Extract Configuration

An extract configuration object. This object defines the way that data is extracted from the source records. The extract definition is applied to a specific topic or file group as defined in the source. Keys are file group names or topic names and values are Extract Settings﻿.
For file based loads, this should match a file_group_name defined in the source.file_groups section. For  based loads this should match a topic defined in Kafka.
Required keys:
None - Extract defaults to JSON for all topics or file_groups unless provided
Example Extract
In the following example, two file groups are defined. The file groups defined in the source section are referenced in the extract section to define how data is extracted from the file group. For Kafka based loads, the topic names are used in the extract section to define extraction if required.
JSON
{
    ...,
    "source": {
        "type": "s3",
        "endpoint": "http://some.endpoint/",
        "bucket": "my_bucket",
        "file_groups": {
            "delimited_user_data": {
                "prefix": "crm/users/",
                "file_matcher_syntax": "glob",
                "file_matcher_pattern": "**.csv",
                "sort_type": "lexicographic"
            },
            "delimited_order_data": {
                "prefix": "crm/orders/",
                "file_matcher_syntax": "glob",
                "file_matcher_pattern": "**.json",
                "sort_type": "lexicographic"
            }
        }
    },
    "extract": {
        "delimited_user_data": {
            "record_type": "delimited",
            "headers": ["id", "username", "last_login_at", null, "first_name", "last_name"],
            "field_delimiter": "|",
            "trim_whitespace": true,
            "skip_blank_lines": true,
            "field_optionally_enclosed_by": "\"",
            "null_strings": ["NULL", "N/A"]
        },
        "delimited_order_data": {
            "record_type": "delimited",
            "num_header_lines": 1,
            "exclude_columns": ["user_id", "order_total"],
            "field_delimiter": ","
        }
    },
    "transform": {
        ...
    }
}
{
    ...,
    "source": {
        "type": "s3",
        "endpoint": "http://some.endpoint/",
        "bucket": "my_bucket",
        "file_groups": {
            "delimited_user_data": {
                "prefix": "crm/users/",
                "file_matcher_syntax": "glob",
                "file_matcher_pattern": "**.csv",
                "sort_type": "lexicographic"
            },
            "delimited_order_data": {
                "prefix": "crm/orders/",
                "file_matcher_syntax": "glob",
                "file_matcher_pattern": "**.json",
                "sort_type": "lexicographic"
            }
        }
    },
    "extract": {
        "delimited_user_data": {
            "record_type": "delimited",
            "headers": ["id", "username", "last_login_at", null, "first_name", "last_name"],
            "field_delimiter": "|",
            "trim_whitespace": true,
            "skip_blank_lines": true,
            "field_optionally_enclosed_by": "\"",
            "null_strings": ["NULL", "N/A"]
        },
        "delimited_order_data": {
            "record_type": "delimited",
            "num_header_lines": 1,
            "exclude_columns": ["user_id", "order_total"],
            "field_delimiter": ","
        }
    },
    "transform": {
        ...
    }
}
﻿
Extract Settings
extract.<file_group_name>.record_type
Similar to Source and Sink config, record_type instructs the Pipeline to instantiate a specific extraction implementation for a given type of source record such as Delimited Records (e.g., CSV, TSV), JSON Records, or Fixed Width Binary Records.
Type:
string
Required:
No
Default:
json
Allowed values:
json: see JSON Extract﻿ for additional configuration.
delimited: see Delimited Extract﻿ for additional configuration.
fixed_width_binary: see Fixed Width Binary Extract﻿ for additional configuration.
Shared Extract Configuration
extract.<file_group_name>.null_strings
null_strings instructs the extraction process to convert a string literal such as NA to a null value in the transformation layer. This allows different null string values to be provided. null_strings must be an array of strings. If trim_whitespace is set to true and a string value is trimmed to become a null_string, that value converts to null.
Type:
string[]
Required:
No
Default:
﻿
extract.<file_group_name>.empty_as_null
The empty_as_null setting specifies whether empty fields should be treated as null. An empty field is defined as a string field containing no characters (its length is 0). When set to false, empty fields will be treated literally as empty strings. When set to true, empty fields will be treated as null in the transformation and loading process. If trim_whitespace is set to true and empty_as_null is set to true, a field that consists of only whitespace will also be converted to null.
Type:
boolean
Required:
No
Default:
True for the Delimited and FWB extract types, False for a JSON extract type
extract.<file_group_name>.trim_whitespace
The trim_whitespace setting specifies whether whitespace should be trimmed from the beginning and end of each string field as it is extracted.
Type:
boolean
Required:
No
Default:
False
extract.<file_group_name>.encoding
The encoding setting specifies the file encoding to use. Encodings should use standard  encodings defined in the Java Internationalization Guide.
Type:
string
Required:
No
Default:
UTF8
extract.<file_group_name>.replace_invalid_characters
replace_invalid_characters defines whether invalid characters based on the encoding type should be replaced with a replacement character (i.e. U+FFFD). If this setting is false and invalid characters are encountered, the record will not be loaded and a record extraction error will be logged.
Type:
boolean
Required:
No
Default:
False
JSON Extract
The JSON Extract type allows LAT to extract JSON data from the source records. There are no additional configuration settings available for JSON beyond the common Extract settings.
Example JSON Extract Configuration
In this example JSON extract configuration, the replace_invalid_characters setting is used to convert any invalid UTF-8 characters with the UTF-8 replacement character. The file group name assigned to this file group was defined as "json_user_data" in the source section of the pipelines. When transforming this data, you can reference columns by the JSON object keys.
JSON
{
    "extract": {
        "json_user_data" : {
            "record_type": "json",
            "replace_invalid_characters": true
        }

    },
    ...
}
{
    "extract": {
        "json_user_data" : {
            "record_type": "json",
            "replace_invalid_characters": true
        }

    },
    ...
}
﻿
Delimited Extract
The Delimited Extract type allows the LAT to extract delimited data from the source records. The delimiter character is configurable to support CSV, TSV, or other delimited formats like pipe and semicolon. Fixed width files are a separate extraction type.
Notes on Delimited Files
Delimited files can have additional comment rows that are prefixed by a # character. These lines are not processed and will be skipped during loading.
A maximum number of 1024 columns can be loaded in a single delimited file.
You can enclose fields with an optional enclosure character. In these cases, the enclosure character is not retained in the loaded data.
You can skip a configurable number of header lines during loading to prevent column headers from appearing as loaded data.
The system parses headers from the beginning of each file if you configure num_header_lines﻿ and do not configure headers﻿.
The system automatically names columns if you do not configure  ﻿num_header_lines﻿ and headers﻿.  The columns are named "$0", "$1", and so on consecutively instead.
You can define multiple file groups. This allows different delimited extraction settings on each file group.
The default record delimiter is a \n newline character. If source data differs, this should be overridden.
Records with improperly escaped "quote characters" or malformed "quoting" of column data will not parse. See the  ﻿num_header_lines﻿  setting for more details.
Control characters such as the NULL byte 0x0 are automatically processed as whitespace when parsing delimited records.
Delimited Configuration
extract.<file_group_name>.field_delimiter
The field_delimiter specifies the string that should function as a column delimiter. The field_delimiter character applies to any unquoted and unescaped text when extracting records from the source file. This can be a single character or a multi-character string delimiter. Only one field_delimiter string is allowed per extract.
Type:
string
Required:
No
Default:
,
extract.<file_group_name>.record_delimiter
The record_delimiter specifies the string that indicates the end of a record. The record_delimiter character applies to any unquoted and unescaped text when extracting records from the source file. This can be a single character or a two-character string delimiter. Only one record_delimiter string is allowed per extract.
The default is a standard newline character \n. Depending on the file, typical settings for this delimiter are a newline \n, a carriage return \r or a carriage return and newline \r\n.
Type:
string; maximum 2 characters
Required:
No
Default:
\n
extract.<file_group_name>.field_optionally_enclosed_by
Also referred to as the "quote character," the field_optionally_enclosed_by settings specifies the character that is optionally used to enclose a field or column. The data between enclosure characters can contain delimiter characters. The enclosure character itself must be escaped inside of an enclosed field. Common values are " to enclose fields by double quotes. Because this setting defines an optional enclosure, not all fields in the record need to be enclosed. The escape character used to escape the enclosure character is separately configured and defaults to " per the RFC-4180 standard.
Type:
string; maximum 1 character
Required:
No
Default:
"
Example:
In the following pipe delimited record, the double quote character can be specified so that the final column will be this field has a | in it when loaded into the database. If no enclosure character is specified, this record would yield four columns instead of three.
Text
a|b|"this field has a | in it"
a|b|"this field has a | in it"
﻿
extract.<file_group_name>.escape
The escape settings specifies the character that is used as an escape occurrences of the enclosing character used to value so that the following character is escaped. This is typically used in enclosed columns to escape the enclosure character defined as field_optionally_enclosed_by.
Type:
string; maximum 1 character
Required:
No
Default:
"
extract.<file_group_name>.num_header_lines
The num_header_lines setting defines the number of header lines in the source. The system reads these header lines but does not load them as data.
If you do not specify headers﻿, the system uses the final line within this section to name the columns in each file. Then, the system uses these names in the Transform section to refer to column values. The system names columns for each file, so the names available in the Transform section might change if the file header changes.
This setting only applies to file-based source types.
Type:
int
Required:
No
Default:
0
extract.<file_group_name>.headers
The headers setting defines the header labels associated with each column in a delimited file. This array of values is associated with the columns in order from left to right. The strings assigned to the column names are used in the Transform section to refer to column values. Any explicitly supplied headers that are an empty string "" or a null literal will be skipped during extraction. This can have performance benefits if all columns are not needed.
Type:
string[]
Required:
No
Default:
null
extract.<file_group_name>.include_columns
The include_columns setting restricts extraction to the columns with the specified names. If you do not need all columns during transformation, you might improve performance when you use the include_columns setting.
If there are no headers and you specify this setting, the system names columns based on their original positions before other columns are filtered out. For example, if there are four columns and $0 and $3 are included, the system names these columns $0 and $3 during transformation, not $0 and $1.
If you specify the exclude_columns setting, you cannot specify this setting.
Type:
string[]
Required:
No
Default:
null
extract.<file_group_name>.exclude_columns
The exclude_columns setting causes the LAT to not extract any of the specified columns. If you do not need all columns during transformation, you might improve performance when you use the exclude_columns setting.
If there are no headers and you specify this setting, the system names columns based on their original positions before other columns are filtered out. For example, if there are four columns and $1 and $2 are excluded, the system names these columns $0 and $3 during transformation, not $0 and $1.
If you specify the include_columns setting, you cannot specify this setting.
Type:
string[]
Required:
No
Default:
null
extract.<file_group_name>.skip_blank_lines
The skip_blank_lines setting specifies whether lines with only whitespace should be skipped or processed.
Type:
boolean
Required:
No
Default:
False
Example Delimited Extract Configuration
In this example extract configuration, a pipe delimited file type with five column names is provided. The file group name assigned to this file group was defined as "delimited_user_data" in the source section of the pipelines. Whitespace will be trimmed from each field and blank lines will be skipped instead of reporting errors. Fields with special characters can be escaped by ". Any literal values of NULL or N/A are converted to null. When transforming this data, columns can be referred to by id, username, last_login_at, first_name, and last_name. The fourth column is listed as null in the header, so it will be skipped during extraction. Because num_header_lines is not specified, the files are assumed to have no headers.
JSON
{
    "extract": {
        "delimited_user_data": {
            "record_type": "delimited",
            "headers": ["id", "username", "last_login_at", null, "first_name", "last_name"],
            "field_delimiter": "|",
            "trim_whitespace": true,
            "skip_blank_lines": true,
            "include_columns": ["id", "username", "last_name"],
            "field_optionally_enclosed_by": "\"",
            "null_strings": ["NULL", "N/A"]
        },
        ...
    },
    ...
}
{
    "extract": {
        "delimited_user_data": {
            "record_type": "delimited",
            "headers": ["id", "username", "last_login_at", null, "first_name", "last_name"],
            "field_delimiter": "|",
            "trim_whitespace": true,
            "skip_blank_lines": true,
            "include_columns": ["id", "username", "last_name"],
            "field_optionally_enclosed_by": "\"",
            "null_strings": ["NULL", "N/A"]
        },
        ...
    },
    ...
}
﻿
Fixed Width Binary Extract
The Fixed Width Binary (FWB) Extract type allows the LAT to extract data from binary source records that have a fixed number of bytes per column. FWB source records typically come from sources like mainframe systems like the  series that use Cobol Copybooks to produce character, integer, floating point, and decimal data in various byte formats. The FWB Extract type requires users to specify the byte offset, width in bytes, and the type of extraction to use on the indicated bytes for each column. The FWB type also allows users to specify a record length in bytes rather than a character record delimiter. FWB files are typically extracted using encodings such as cp500 and cp037.
Fixed Width Binary Data Types
The following data types are supported when extracting Fixed Width Binary data. All data types require specification of offset and width to locate the bytes to extract in each record.
Data Type 
Description 
Supported Settings 
string 
Represents bytes of character data following the encoding defined in the extract settings. Bytes are extracted into a string for transformation and loading. 
encoding - inherited 
integer 
Represents a signed integer. Uses the endianness setting to determine how bytes should be interpreted. The integer type can be used for 1, 2, 4, or 8 bytes of data. These correspond to mainframe concepts of a Byte, Half-word, Full-word, and Double-word. 
endianness - inherited 
unsigned_integer 
Represents a unsigned integer. Uses the endianness setting to determine how bytes should be interpreted. Can be used for 1, 2, 4, or 8 bytes of data. The largest integer data type in the  system is a signed BIGINT, so the maximum value for an unsigned_integer is 2^63-1. 
endianness - inherited 
float 
Represents a floating point number as defined by the -754 floating point specification. Can be used for 4 or 8 bytes of data. 4 bytes represents a single precision floating point number, and 8 bytes represents a double precision floating point number. 
﻿
 
packed_decimal 
A Packed Binary Coded Decimal format for decimal precision that packs two integral values in each byte and encodes the sign in the final byte. The number of decimal points represented by the data is supplied with a separate extraction setting. The sign is encoded in the 4 least significant bits of the final byte. A value of 0xD or 0xB indicates a negative sign. Any other value including the common values of 0xF, 0xE, 0xC or 0xA indicating positive or unsigned. 
endianness - inherited 
decimals - number of digits to place the right of the decimal point 
sign_format - defaults to 'trailing' byte format indicating that the sign is encoded in the most significant four bits of the final byte; provided for future extensibility. 
zoned_decimal 
A Zoned Binary Coded Decimal format for decimal precision that packs two integral values in each byte and encodes the sign in the final byte. The number of decimal points represented by the data is supplied with a separate extraction setting. The sign is encoded in the 4 most significant bits of the final byte. A value of 0xD, 0xB, 0x7 indicates a negative sign. Values of 0xF, 0xE, 0xC or 0xA indicate positive or unsigned. 
endianness - inherited 
decimals - number of digits to place the right of the decimal point 
sign_format - defaults to 'trailing' byte format indicating that the sign is encoded in the most significant four bits of the final byte; provided for future extensibility. 
Fixed Width Binary Configuration
extract.<file_group_name>.record_width
The record_width setting specifies the total number of bytes in each record. Files are read serially, and the next record is assumed to start on the following byte. The record_width can be greater than the number of bytes used in extracting the individual columns.
Type:
integer
Required:
Yes
Default:
﻿
extract.<file_group_name>.padding_character
The padding_character setting specifies a character in a fixed width file type that should be removed from extracted character data. After extracting a character field, any leading or trailing occurrences of this character will be stripped from the ends of the character data. Typical examples are the space character or an underscore.
Type:
string
Required:
No
Default:
" " (in the specified encoding)
extract.<file_group_name>.endianness
The endianness setting specifies whether the extraction of numeric fields should use Big Endian or Little Endian byte ordering to process extracted bytes. endianness impacts integer, packed_decimal, and zoned_decimal column types.
Type:
string
Required:
No
Default:
big
Allowed values:
big: Big Endian
little: Little Endian
extract.<file_group_name>.columns
A extract columns configuration object. A collection of columns and their associated extraction configuration. Keys are column names and values are Column Extracts configuration﻿. 
Required keys:
﻿extract.columns.<columns> ﻿ 
Column Extracts
Column extract configurations for Fixed Width Binary records.
extract.<file_group_name>.columns.<column>
A column extract configuration object keyed by a column name. A column’s value is defined as an extract definition. The extract definition will indicate the offset and width to locate the bytes to extract, the data type to use in the extraction, and any additional parameters about the bytes to extract.
Required keys:
offset
width
data_type
Optional keys:
decimals
sign_format
extract.<file_group_name>.columns.<column>.offset
Number of bytes to skip from the beginning of the record before beginning to extract bytes for this column. Note that offset is relative to the beginning of the record, so arbitrary offsets can be used including overlapping sets of bytes with other columns.
Type:
integer
Required:
Yes
Default:
﻿
extract.<file_group_name>.columns.<column>.width
Number of bytes to extract from the offset location.
Type:
integer
Required:
Yes
Default:
﻿
extract.<file_group_name>.columns.<column>.data_type
Data type to extract from the bytes that are specified by offset and width. See Fixed Width Binary Data Types﻿ for reference on available data types.
Type:
string
Required:
Yes
Default:
string
Allowed values:
string
integer
unsigned_integer
float
packed_decimal
zoned_decimal
Note that each data_type has specific settings that are allowed to control the extraction to that data_type. Some of these are inherited from the overall extract configuration, while others are defined on each column.
extract.<file_group_name>.columns.<column>.decimals
The number of numeric values that are placed to the right of the decimal point in a zoned_decimal or packed_decimal data type. For example, if decimals is set to 2` and the data in a zoned_decimal represents the numerals 12345, this would create the decimal 123.45` in the extracted value.
decimals configures the scale of the extracted decimal and the total number of digits present in the source data determines the precision. If insufficient scale or precision exists in the target Ocient column, data will fail to load.
Type:
integer
Required:
No
Default:
0
Applies to:
packed_decimal
zoned_decimal
extract.<file_group_name>.columns.<column>.sign_format
In decimal types, the sign for the number can be encoded in different ways. Currently, only the trailing format is supported which requires the sign to be encoded in 4 bits of the last byte in the column data. This setting is provided for future extensibility. The values that indicate a positive or negative sign are listed in Fixed Width Binary Data Types﻿.
Type:
string
Required:
No
Default:
trailing
Allowed values:
trailing - the sign is encoded in the most significant 4 bits of the last byte in the binary coded decimal.
Applies to:
packed_decimal
zoned_decimal
Fixed Width Binary Extract Example
The following example illustrates extracting fixed width binary data from a file group that was given the name "order_files" in the source section of the pipeline. The file has 3100 bytes per record and is only extracting a few columns.
JSON
{
    "extract": {
        "order_files": {
            "record_type": "fixed_width_binary",
            "record_width": 3100,
            "endianness": "big",
            "columns": {
                "order_id": {
                    "offset": 0,
                    "width": 4,
                    "data_type": "integer"
                },
                "username": {
                    "offset": 4,
                    "width": 20,
                    "data_type": "string"
                },
                "order_total": {
                    "offset": 24,
                    "width": 10,
                    "data_type": "packed_decimal",
                    "decimals": 2
                },
                "sales_tax_total": {
                    "offset": 34,
                    "width": 8,
                    "data_type": "zoned_decimal",
                    "decimals": 3
                }
            }
        }
    }
}
{
    "extract": {
        "order_files": {
            "record_type": "fixed_width_binary",
            "record_width": 3100,
            "endianness": "big",
            "columns": {
                "order_id": {
                    "offset": 0,
                    "width": 4,
                    "data_type": "integer"
                },
                "username": {
                    "offset": 4,
                    "width": 20,
                    "data_type": "string"
                },
                "order_total": {
                    "offset": 24,
                    "width": 10,
                    "data_type": "packed_decimal",
                    "decimals": 2
                },
                "sales_tax_total": {
                    "offset": 34,
                    "width": 8,
                    "data_type": "zoned_decimal",
                    "decimals": 3
                }
            }
        }
    }
}
﻿
Related Links
﻿Load Data﻿﻿

Type:	string
Required:	No
Default:	json

Type:	string[]
Required:	No
Default:

Type:	boolean
Required:	No
Default:	True for the Delimited and FWB extract types, False for a JSON extract type

Type:	boolean
Required:	No
Default:	False

Type:	string
Required:	No
Default:	UTF8

Type:	boolean
Required:	No
Default:	False

Type:	string
Required:	No
Default:	,

Type:	string; maximum 2 characters
Required:	No
Default:	\n

Type:	string; maximum 1 character
Required:	No
Default:	"

Type:	string; maximum 1 character
Required:	No
Default:	"

Type:	int
Required:	No
Default:	0

Type:	string[]
Required:	No
Default:	null

Type:	string[]
Required:	No
Default:	null

Type:	string[]
Required:	No
Default:	null

Type:	boolean
Required:	No
Default:	False

Data Type	Description	Supported Settings
string	Represents bytes of character data following the encoding defined in the extract settings. Bytes are extracted into a string for transformation and loading.	encoding - inherited
integer	Represents a signed integer. Uses the endianness setting to determine how bytes should be interpreted. The integer type can be used for 1, 2, 4, or 8 bytes of data. These correspond to mainframe concepts of a Byte, Half-word, Full-word, and Double-word.	endianness - inherited
unsigned_integer	Represents a unsigned integer. Uses the endianness setting to determine how bytes should be interpreted. Can be used for 1, 2, 4, or 8 bytes of data. The largest integer data type in the system is a signed BIGINT, so the maximum value for an unsigned_integer is 2^63-1.	endianness - inherited
float	Represents a floating point number as defined by the -754 floating point specification. Can be used for 4 or 8 bytes of data. 4 bytes represents a single precision floating point number, and 8 bytes represents a double precision floating point number.
packed_decimal	A Packed Binary Coded Decimal format for decimal precision that packs two integral values in each byte and encodes the sign in the final byte. The number of decimal points represented by the data is supplied with a separate extraction setting. The sign is encoded in the 4 least significant bits of the final byte. A value of 0xD or 0xB indicates a negative sign. Any other value including the common values of 0xF, 0xE, 0xC or 0xA indicating positive or unsigned.	endianness - inherited decimals - number of digits to place the right of the decimal point sign_format - defaults to 'trailing' byte format indicating that the sign is encoded in the most significant four bits of the final byte; provided for future extensibility.
zoned_decimal	A Zoned Binary Coded Decimal format for decimal precision that packs two integral values in each byte and encodes the sign in the final byte. The number of decimal points represented by the data is supplied with a separate extraction setting. The sign is encoded in the 4 most significant bits of the final byte. A value of 0xD, 0xB, 0x7 indicates a negative sign. Values of 0xF, 0xE, 0xC or 0xA indicate positive or unsigned.	endianness - inherited decimals - number of digits to place the right of the decimal point sign_format - defaults to 'trailing' byte format indicating that the sign is encoded in the most significant four bits of the final byte; provided for future extensibility.

Type:	integer
Required:	Yes
Default:

Type:	string
Required:	No
Default:	" " (in the specified encoding)

Type:	string
Required:	No
Default:	big

Type:	integer
Required:	Yes
Default:

Type:	integer
Required:	Yes
Default:

Type:	string
Required:	Yes
Default:	string

Type:	integer
Required:	No
Default:	0

Type:	string
Required:	No
Default:	trailing

Updated 31 Jan 2024

Did this page help you?

Sink Configuration

Transform Configuration