Select Data with Pagination and Ordering

Overview

This example demonstrates how to use the Redshift Select Snap with Order by, Limit offset, and Limit rows properties to retrieve paginated results. This approach is essential for efficiently processing large datasets by fetching data in manageable chunks.

Prerequisites

Amazon Redshift cluster with appropriate access
Valid Redshift account
Large table with data requiring pagination

Pipeline Overview

The pipeline uses the Redshift Select Snap to fetch data in pages, controlling the number of rows returned and the starting point for each query. This is particularly useful when working with large datasets that cannot be processed in a single query.

Pagination Configuration

Redshift Select Snap - Settings:

Schema name: public
Table name: demo_guest
Order by: Specify columns for consistent ordering (e.g., inst_dt, id)
Limit offset: Starting row number (e.g., 0 for first page, 10 for second page)
Limit rows: Number of rows per page (e.g., 10)

How Pagination Works

The Snap generates SQL queries with the following format:

SELECT * FROM [table]
WHERE [where clause]
ORDER BY [ordering]
LIMIT [limit rows]
OFFSET [limit offset]

Example - First Page:

SELECT * FROM "public"."demo_guest"
ORDER BY "inst_dt", "id"
LIMIT 10
OFFSET 0

Example - Second Page:

SELECT * FROM "public"."demo_guest"
ORDER BY "inst_dt", "id"
LIMIT 10
OFFSET 10

Example - Third Page:

SELECT * FROM "public"."demo_guest"
ORDER BY "inst_dt", "id"
LIMIT 10
OFFSET 20

Implementing Multi-Page Retrieval

To retrieve multiple pages of data:

Set up the initial query with Limit rows = 10 and Limit offset = 0
Use a loop or multiple Snap instances to increment the offset
Continue until the number of returned rows is less than the limit

Alternatively, use pipeline parameters to control pagination:

Define page_size parameter for the number of rows
Define page_number parameter to calculate offset (page_size * page_number)

Ordering for Consistency

The Order by property is crucial for pagination to ensure consistent results across pages:

Always specify at least one column for ordering
Include unique columns (like primary keys) for deterministic ordering
Use multiple columns if needed for complete ordering

Note:

Without consistent ordering, the same row might appear in multiple pages or be skipped entirely if data is modified between queries.

Output Configuration

Additional options to control output:

Output fields: Specify specific columns to reduce data transfer (leave blank for all columns)
Pass through: Include original input document in output under 'original' key
Ignore empty result: Skip output documents when no results are found

Use Cases

This pattern is useful for:

Processing large datasets in manageable chunks
Implementing data export with progress tracking
Building APIs that return paginated results
Reducing memory consumption when working with large tables
Implementing incremental data loads
Creating reports with page-by-page processing

Best Practices

Always use Order by with pagination to ensure consistent results
Choose an appropriate page size based on memory constraints and performance
Use indexed columns in the ORDER BY clause for better performance
Consider using a unique column (like primary key) in ORDER BY for deterministic ordering
Monitor query performance and adjust page size if needed
Use pipeline parameters for flexible pagination configuration
Test pagination logic with various dataset sizes