Common data model
The common data model provides a level of indirection between source and target endpoints.
The AutoSync common data model (CDM) provides a level of indirection between source and target endpoints. It decouples the source schema from the target schema so that AutoSync can handle the data as required by each endpoint. With the CDM, you only need to provide data type mappings between your source or target and the common model. Without the CDM, you would need to provide mappings between a source endpoint and every supported target, or between a target endpoint and every supported source.
The CDM includes:
-
A catalog of supported Common Data Types (CDTs). Source endpoints must provide a mapping from source types to the CDTs and target endpoints must map CDTs to target data types.
-
The CDM Common Data Description (CDD). AutoSync uses the CDD at schema introspection time to:
- Set schema attributes such as scale, precision, and primary key state
- Create a schema
- Detect and resolve schema drift
AutoSync divides ingestion into two phases: schema introspection (and schema drift detection) and data loading. Schema introspection requires the CDTs and CDDs. Data loading involves the Common Data Values (CDVs). When AutoSync loads data to the target, it enforces size mapping to ensure data integrity.
Source endpoints need data transformation functions to convert certain data fields into a CDV for data loading. Target endpoints need transformation functions to convert from the CDV to target values. Source endpoints also need to map source columns into an internal PRIMARYKEY that AutoSync uses to support the Incremental and SCD2 load types. You provide these transformations and mappings in the definitions pipeline. CDM mapping examples shows the expressions used in the SimpleGoogleSheets and SimpleSnowflake pipelines.
The purpose of Common Data Types (CDTs)
The CDTs define the data types that the CDM supports. They provide the reference for mapping data types from source endpoint schemas to the CDM schema and from the CDM schema to target endpoint schemas. This ensures interoperability and data consistency between different source and destination endpoints.
Key aspects of the CDTs include:
-
Standardized Catalog: The CDT serves as a centralized repository or catalog with a predefined set of common data types. These data types are standardized and agreed upon in the CDM framework to ensure consistency and compatibility across different systems and endpoints.
-
Schema Mapping: Source endpoints map their native data types to corresponding CDTs when interacting with the CDM schema. Similarly, target endpoints map CDTs to their respective data types in their destination schemas. This mapping process allows data to be seamlessly translated and synchronized between different systems, regardless of their underlying data structures.
-
Data Type Compatibility: CDTs encompass a wide range of data types commonly encountered in various source systems. This includes basic types such as integers, strings, and floating-point numbers, and more complex types such as dates, timestamps, and binary data. The comprehensive set of data types ensures compatibility and support for diverse data formats and structures.
-
Interoperability: The use of CDTs promotes interoperability between different systems and platforms by establishing a common language for representing and exchanging data. This simplifies data integration and facilitates the seamless transfer of information between source and destination endpoints in the AutoSync environment.
-
Extensibility: While the CDT provides a predefined set of common data types, it is also designed to be extensible, allowing for the addition of custom or specialized data types as needed. This ensures flexibility and adaptability to accommodate unique data requirements and scenarios encountered in specific use cases or industries.
The supported CDTs
AutoSync supports the following CDTs:
├── cdt_char
├── cdt_nchar
├── cdt_datetime
├── cdt_bool
├── cdt_json
└── cdt_string
├── cdt_binary
├── cdt_nstring
├── cdt_float8
│ └── cdt_float4
├── cdt_time
├── cdt_timestamp
│ └── cdt_timestamp_ntz
│ └── cdt_date
└── cdt_bigdecimal
└── cdt_decimal
├── cdt_uint64
└── cdt_int64
├── cdt_uint32
└── cdt_int32
├── cdt_uint24
└── cdt_int24
├── cdt_uint16
└── cdt_int16
├── cdt_uint8
└── cdt_int8
When you define a CDT as null in the mapping, AutoSync routes data through the CDT hierarchy based on
its attributes or characteristics. For example, cdt_date
is a child of
cdt_timestamp
. If a mapping defines cdt_date
as null, AutoSync
routes any data associated with datetime values through the cdt_timestamp
hierarchy
of related data types.
The purpose of the Common Data Description (CDD)
The CDD supports schema introspection and schema drift detection. It serves as a metadata repository for the attributes and the characteristics of data fields in the CDM schema. The CDD specifies the structure and properties of data fields, including, but not limited to, scale, precision, and primary key status.
The CDD includes:
- Schema attributes: Describe the structure and properties of data fields in the CDM schema. For example, the data type, size, precision, scale, primary key, and other relevant metadata.
- Data type parameters: Fixed and parameterized data types. For example, BYTE and INT4 types have fixed sizes. NUMBER and VARCHAR types can have parameters such as precision, scale, or maximum size.
AutoSync uses the CDD for:
Schema introspection: AutoSync retrieves the CDD for information about each data field, including its data type and associated parameters. This information encapsulates the structure of the source schema and enables effective mapping to the CDM schema.
Schema drift detection and resolution: During synchronization, AutoSync compares the current state of the source schema with the information stored in the CDD. This enables it to identify and resolve discrepancies and changes.
Schema creation and maintenance: When creating or updating the CDM schema, AutoSync refers to the CDD to ensure that the schema accurately reflects the source data structure and attributes.
A CDD has the following structure:
[
{
"catalog": "",
"schema": "",
"table": "",
"column": "<colum_name>",
"dataType": "",
"precision": 0,
"scale": null,
"isNullable": true,
"position": 0,
"defaultValue": null,
"isPrimaryKeyColumn": false,
"isLastUpdateColumn": false,
"srcPrecision": 0,
"srcScale": 0,
"mapColumnName": null,
"columnDefinition": "<column_name> datatype(p)",
"primaryKeyColumn": false,
"lastUpdateColumn": false,
"nullable": true,
"srcType": "<dataType>",
"srcColumn": "",
"select_column": "<computed_column_from_selectAs>"
},
...
]
Common Data Values (CDVs)
The CDM framework uses CDVs during the data loading phase of the AutoSync implementation. CDVs offer an intermediary representation of data to simplify transformation and synchronization during data loading. CDVs can be easily manipulated and transferred between source and destination endpoints.
The CDVs enable data to flow from the source to the target:
Intermediary representation: Source endpoints provide data transformation functions to convert raw data from their native format into CDVs before transferring them to the CDM schema.
Standardized format: The standard structure defined by CDM ensures consistency and compatibility across different data sources and destination endpoints, regardless of their native data format or schema.
Data transformation: Source endpoints use data transformation functions to convert raw data fields into CDVs based on predefined rules and mappings. These functions might perform tasks such as data normalization, type conversion, and data cleansing to ensure that the resulting CDVs are accurate and compatible with the CDM schema.
Target endpoint integration: Data in CDV format can be easily transferred and integrated with target endpoints in the CDM ecosystem. Target endpoints provide functions to convert CDVs into their respective data formats or structures, enabling seamless data synchronization and loading.
Flexibility and adaptability: The CDV concept allows for flexibility and adaptability in handling diverse data types and structures. It accommodates a wide range of data formats and can be extended to support custom or specialized data transformations for specific use cases or requirements.
Custom endpoints convert data into CDVs in one of the following ways:
- Use
mapData
andmapValue
parameters to define expressions for data conversion. - Use the
selectAs
parameters to retrieve CDV values directly from the database query.
Target size mapping
The target Common Data Model (CDM) applies size mapping during the data loading process in the AutoSync implementation. It determines how to map the parameterized data types from the source schema to the target schema and conform to the size limit constraints imposed by the target system.
The target size mapping covers common database data types such as: array
,
binary
, boolean
, date
, datetime
,
float
, geography
, number
,
object
, time
, timestamp
types (for example,
timestamp_ltz
, timestamp_ntz
, timestamp_tz), and
varchar
(and its variants).
Size Limits: For each data type category, minimum and maximum size limits define the allowed size
range for data fields of that type in the target schema. Parameters such as precision and scale
express the size limit for numeric data types. Maximum length defines the size limit for
character-based data types, such as varchar
.
Adjustments for Constraints: The specified size limits for each data type category in the CDD might require adjustments. For example, if the source schema size limit exceeds the target schema maximum, the CDD must enforce the target constraints.
Customization for Target Systems: The target size mapping can be customized for different target systems or databases to accommodate unique size constraints or requirements.