OrochiDB Documentation

Learn how to use OrochiDB's powerful PostgreSQL HTAP features including automatic sharding, time-series optimization, columnar storage, and more.

Quick Start Guide

Get started with OrochiDB on HowlerOps in just a few steps.

1. Create an Account

Sign up for a free HowlerOps account at howlerops.com/register. No credit card required.

2. Create Your First Cluster

From the dashboard, click "New Cluster" and configure:

  • Choose a name for your cluster
  • Select your cloud provider (AWS, GCP, or Azure)
  • Pick a region close to your users
  • Choose your cluster tier (Free, Standard, or Professional)

3. Connect to Your Database

Use your connection string to connect with any PostgreSQL client:

# Using psql
psql "postgresql://username:[email protected]:5432/dbname"

# Using Node.js
const { Pool } = require('pg');
const pool = new Pool({
  connectionString: 'postgresql://username:[email protected]:5432/dbname'
});

# Using Python
import psycopg2
conn = psycopg2.connect("postgresql://username:[email protected]:5432/dbname")

Automatic Sharding

OrochiDB automatically distributes your data across multiple nodes using hash-based sharding. This allows you to scale horizontally without changing your application code.

-- Create a distributed table with automatic sharding
SELECT create_distributed_table('orders', 'customer_id');

-- Or specify the number of shards
SELECT create_distributed_table(
  'orders',
  'customer_id',
  shard_count => 32
);

-- Create a reference table (replicated to all nodes)
SELECT create_reference_table('countries');

Best Practices

Choose your distribution column wisely: Pick a column with high cardinality that's frequently used in WHERE clauses and JOINs.

Use reference tables for lookup data: Small tables that are frequently joined should be reference tables.

Co-locate related tables: Tables that are frequently joined should use the same distribution column.

Time-Series Optimization

OrochiDB automatically partitions time-series data into chunks for optimal query performance and data lifecycle management.

-- Create a regular table
CREATE TABLE sensor_data (
  time        TIMESTAMPTZ NOT NULL,
  sensor_id   INTEGER,
  temperature DOUBLE PRECISION,
  humidity    DOUBLE PRECISION
);

-- Convert to hypertable with 1-day chunks
SELECT create_hypertable(
  'sensor_data',
  'time',
  chunk_time_interval => INTERVAL '1 day'
);

-- Create distributed hypertable
SELECT create_distributed_hypertable(
  'sensor_data',
  'time',
  'sensor_id',
  chunk_time_interval => INTERVAL '1 day'
);

Columnar Storage

Store analytical data in column-oriented format with advanced compression for 10x+ faster analytics queries and up to 90% storage savings.

-- Convert existing table to columnar storage
ALTER TABLE analytics_events
SET ACCESS METHOD columnar;

-- Create new columnar table
CREATE TABLE events (
  id BIGINT,
  event_type TEXT,
  payload JSONB,
  created_at TIMESTAMPTZ
) USING columnar;

-- Set compression options
ALTER TABLE events SET (
  columnar.compression = zstd,
  columnar.stripe_row_limit = 150000,
  columnar.chunk_group_row_limit = 10000
);

Tiered Storage

Automatically move data between storage tiers based on age and access patterns. Reduce costs while maintaining query access to all your data.

-- Configure tiered storage policy
SELECT set_tiered_storage_policy(
  'events',
  hot_to_warm_after => INTERVAL '7 days',
  warm_to_cold_after => INTERVAL '30 days',
  cold_to_frozen_after => INTERVAL '90 days'
);

-- Hot tier: NVMe SSD (fastest, most expensive)
-- Warm tier: SSD (balanced performance/cost)
-- Cold tier: HDD (cost-effective storage)
-- Frozen tier: S3/Object storage (lowest cost)

-- View tier distribution
SELECT * FROM orochi_tier_stats('events');

-- Query across all tiers transparently
SELECT * FROM events
WHERE created_at > NOW() - INTERVAL '1 year';

Vector Search & AI

Build AI-powered applications with native vector storage and SIMD-optimized similarity search operations.

-- Create table with vector column
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  title TEXT,
  content TEXT,
  embedding vector(1536)  -- OpenAI ada-002 dimensions
);

-- Create HNSW index for fast similarity search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Insert embeddings from your AI model
INSERT INTO documents (title, content, embedding)
VALUES ('My Document', 'Content here...', '[0.1, 0.2, ...]');

-- Find similar documents
SELECT title, content,
       1 - (embedding <=> query_embedding) AS similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 10;

-- Hybrid search with filters
SELECT * FROM documents
WHERE category = 'tech'
ORDER BY embedding <=> query_embedding
LIMIT 10;

Change Data Capture (CDC)

Stream database changes to external systems in real-time. Perfect for event-driven architectures, data lakes, and search indexes.

-- Create CDC subscription
SELECT create_cdc_subscription(
  'orders_to_kafka',
  source_tables => ARRAY['orders', 'order_items'],
  destination => 'kafka://broker:9092/orders-topic',
  format => 'avro'
);

-- Configure CDC options
ALTER SUBSCRIPTION orders_to_kafka SET (
  include_transaction_ids => true,
  include_timestamps => true,
  batch_size => 1000
);

-- Monitor CDC lag
SELECT * FROM cdc_subscription_stats('orders_to_kafka');

-- Pause/resume subscription
SELECT pause_cdc_subscription('orders_to_kafka');
SELECT resume_cdc_subscription('orders_to_kafka');

Data Pipelines

Ingest data from Kafka, S3, and other sources directly into your database with built-in transformation capabilities.

-- Create Kafka ingestion pipeline
SELECT create_pipeline(
  'kafka_events',
  source => 'kafka://broker:9092/events',
  target_table => 'events',
  format => 'json',
  transform => $$
    SELECT
      data->>'id' AS id,
      data->>'type' AS event_type,
      (data->>'timestamp')::timestamptz AS created_at
    FROM source_data
  $$
);

-- Create S3 ingestion pipeline
SELECT create_pipeline(
  's3_logs',
  source => 's3://my-bucket/logs/*.parquet',
  target_table => 'access_logs',
  format => 'parquet',
  schedule => 'every 1 hour'
);

-- Monitor pipeline status
SELECT * FROM pipeline_stats();

Security & Access Control

Enterprise-grade security with encryption, role-based access control, and audit logging.

Encryption

  • TLS 1.3 for data in transit
  • AES-256 encryption at rest
  • Customer-managed encryption keys (Enterprise)

Authentication

  • Database username/password
  • JWT token authentication
  • SAML/SSO integration (Enterprise)
  • IP allowlisting

Compliance

  • SOC 2 Type II certified
  • GDPR compliant
  • HIPAA compliant (Enterprise)
  • Full audit logging

Next Steps