Select Data with Pagination and Ordering

Overview

This example demonstrates how to use the Redshift Select Snap with Order by, Limit offset, and Limit rows properties to retrieve paginated results. This approach is essential for efficiently processing large datasets by fetching data in manageable chunks.

Prerequisites

  • Amazon Redshift cluster with appropriate access
  • Valid Redshift account
  • Large table with data requiring pagination

Pipeline Overview

The pipeline uses the Redshift Select Snap to fetch data in pages, controlling the number of rows returned and the starting point for each query. This is particularly useful when working with large datasets that cannot be processed in a single query.

Pagination Configuration

Redshift Select Snap - Settings:

  • Schema name: public
  • Table name: demo_guest
  • Order by: Specify columns for consistent ordering (e.g., inst_dt, id)
  • Limit offset: Starting row number (e.g., 0 for first page, 10 for second page)
  • Limit rows: Number of rows per page (e.g., 10)

How Pagination Works

The Snap generates SQL queries with the following format:

SELECT * FROM [table]
WHERE [where clause]
ORDER BY [ordering]
LIMIT [limit rows]
OFFSET [limit offset]

Example - First Page:

SELECT * FROM "public"."demo_guest"
ORDER BY "inst_dt", "id"
LIMIT 10
OFFSET 0

Example - Second Page:

SELECT * FROM "public"."demo_guest"
ORDER BY "inst_dt", "id"
LIMIT 10
OFFSET 10

Example - Third Page:

SELECT * FROM "public"."demo_guest"
ORDER BY "inst_dt", "id"
LIMIT 10
OFFSET 20

Implementing Multi-Page Retrieval

To retrieve multiple pages of data:

  1. Set up the initial query with Limit rows = 10 and Limit offset = 0
  2. Use a loop or multiple Snap instances to increment the offset
  3. Continue until the number of returned rows is less than the limit

Alternatively, use pipeline parameters to control pagination:

  • Define page_size parameter for the number of rows
  • Define page_number parameter to calculate offset (page_size * page_number)

Ordering for Consistency

The Order by property is crucial for pagination to ensure consistent results across pages:

  • Always specify at least one column for ordering
  • Include unique columns (like primary keys) for deterministic ordering
  • Use multiple columns if needed for complete ordering
Note:

Without consistent ordering, the same row might appear in multiple pages or be skipped entirely if data is modified between queries.

Output Configuration

Additional options to control output:

  • Output fields: Specify specific columns to reduce data transfer (leave blank for all columns)
  • Pass through: Include original input document in output under 'original' key
  • Ignore empty result: Skip output documents when no results are found

Use Cases

This pattern is useful for:

  • Processing large datasets in manageable chunks
  • Implementing data export with progress tracking
  • Building APIs that return paginated results
  • Reducing memory consumption when working with large tables
  • Implementing incremental data loads
  • Creating reports with page-by-page processing

Best Practices

  • Always use Order by with pagination to ensure consistent results
  • Choose an appropriate page size based on memory constraints and performance
  • Use indexed columns in the ORDER BY clause for better performance
  • Consider using a unique column (like primary key) in ORDER BY for deterministic ordering
  • Monitor query performance and adjust page size if needed
  • Use pipeline parameters for flexible pagination configuration
  • Test pagination logic with various dataset sizes