Creating Indexes on 750 Million Rows in PostgreSQL

Monitoring PostgreSQL Index Creation Progress on Large Tables

The Challenge: Adding an Index to 750 Million Rows

Database administrators and engineers often face the daunting task of adding new indexes to large tables in production environments. In this real-world case study, we'll walk through the process of creating a critical unique index on a PostgreSQL table containing 750 million records on an AWS RDS instance (db.r6g.4xlarge).

Understanding the Problem

Our large transaction_records table needed a unique index on the id column to improve query performance. For tables of this size, index creation operations are not trivial and can take significant time to complete.

sql
-- Simplified table structure
create table public.transaction_records
(
    id                     uuid             not null,
    transaction_id         uuid,
    category_id            integer          not null,
    account_code           bigint           not null,
    amount                 double precision not null,
    remaining_amount       double precision,
    is_active              smallint         not null,
    processed_amount       double precision default 0,
    fiscal_year            bigint,
    branch_code            bigint,
    -- other metadata columns
);

The Impact of Missing Indexes

When we queried the table without an index on the id column, PostgreSQL had to perform a full sequential scan:

sql
explain select * from transaction_records where id = 'f67a3c21-45e8-412d-b897-d28ab5c7f931';

Result:

text
Gather  (cost=1000.00..20055623.77 rows=1 width=136)
  Workers Planned: 2
  ->  Parallel Seq Scan on transaction_records  (cost=0.00..20054623.67 rows=1 width=136)
        Filter: (id = 'f67a3c21-45e8-412d-b897-d28ab5c7f931'::uuid)

This execution plan reveals a critical performance issue - the database must scan all 750 million rows to find a single record, resulting in:

Extremely high query cost (over 20 million)
Poor performance for applications
Unnecessary resource consumption

Choosing the Right Approach

After a failed attempt to create an index with the standard syntax (which blocked the table for hours), we opted for a less disruptive approach using a concurrent index creation:

sql
CREATE UNIQUE INDEX CONCURRENTLY idx_transaction_records_id ON transaction_records(id);

The CONCURRENTLY option allows regular database operations to continue while the index is being built, making it suitable for production environments.

Monitoring Progress: The Key to Sanity

When operations take hours or even days to complete, visibility becomes crucial. PostgreSQL provides tools to monitor long-running index creations through system views.

1. Check the Process Status

First, identify the running process:

sql
SELECT pid, query, state, now() - pg_stat_activity.query_start AS duration
FROM pg_stat_activity
WHERE query LIKE 'CREATE%INDEX%';

2. Monitor Detailed Progress

PostgreSQL offers detailed progress information through the pg_stat_progress_create_index view:

sql
SELECT * FROM pg_stat_progress_create_index;

Sample output:

text
14820,16404,ptx_prod,410166,419340,CREATE INDEX CONCURRENTLY,waiting for writers before validation,18,14,25860,0,0,730737795,730737795,0,0

3. Understanding Progress Phases

The index creation process goes through several phases:

Initial Build Phase: Scanning the table and building the index structure `` 730,737,795 tuples processed out of 730,737,795 total ``

Waiting for Writers: Ensuring all concurrent writes are accounted for `` lockers_total: 18, lockers_done: 14 ``

Index Validation: Scanning Index: Validating the index structure `` 1,758,055 items processed out of 2,813,624 total ``

Index Validation: Scanning Table: Final validation against table data `` 5,548,280 items processed out of 16,248,698 total ``

Results: The Payoff

After nearly 5 hours, our index creation completed. We could see the immediate impact on query performance:

sql
explain select * from transaction_records where id = 'f67a3c21-45e8-412d-b897-d28ab5c7f931';

New execution plan:

text
Index Scan using idx_transaction_records_id on transaction_records  (cost=0.57..8.59 rows=1 width=136)
  Index Cond: (id = 'f67a3c21-45e8-412d-b897-d28ab5c7f931'::uuid)

The query cost dropped from over 20 million to just 8.59 - a dramatic improvement that translates to queries completing thousands of times faster.

Handling Naming Conflicts

When creating indexes and constraints, we discovered that naming conflicts can occur when objects with the same name exist elsewhere in the schema. If you encounter naming conflicts, use a more distinctive naming convention:

sql
-- Example of using a more distinctive naming pattern
CREATE UNIQUE INDEX CONCURRENTLY idx_tr_2023_id ON transaction_records(id);

Checking for naming conflicts before creating indexes can save time and help avoid errors during the process.

Key Takeaways

Use CONCURRENTLY: For production systems, use concurrent index creation to minimize disruption.
Monitor progress: Utilize PostgreSQL system views to track long-running operations.
Understand resource impact: Operations on large tables require significant resources and time.
Check naming conflicts: Ensure index names are unique within your schema.
Be patient: Index creation on large tables takes time, but the performance benefits are worth it.
Communicate timelines: Set realistic expectations with stakeholders about index creation durations.

Configuration Considerations

For large tables, consider optimizing these PostgreSQL parameters:

maintenance_work_mem: Our setting was 2140275kB (2.14GB)
work_mem: Affects sorting operations during index creation
max_parallel_workers_per_gather: Impacts parallel operations
effective_io_concurrency: Can improve I/O throughput

By understanding how to monitor and manage index creation on large tables, you can make informed decisions when working with massive datasets in PostgreSQL, minimizing downtime while maximizing performance gains.