LAT Reference
LAT Pipeline Configuration
LAT Source Configuration
data pipelines are now the preferred method for loading data into the ocient system for details, see docid\ xq0tg7yph vn62uwufibu source a source configuration object required keys /#sourcetype source type type of source to use in the pipeline type string required yes default allowed values kafka see /#kafka configuration for additional configuration s3 see /#load from a file source and /#s3 source specific configuration for additional configuration local see /#load from a file source for additional configuration load from a file source currently, lat supports loading files from an {{aws}} s3 instance (like aws s3 or {{ibmcloud}} object storage) or from a local file system common configuration for all file sources will be listed in /#common file source configuration , followed by sections describing source specific configurations file sources are defined by creating "file groups" that represent a logical set of files to be loaded each file group is given a file group name that corresponds to the file group name in the transform section of the pipeline configuration each file group has settings to select the specific files that should be loaded the lat supports loading from individual files with extensions such as jsonl and csv gzip compression is supported however, files using the tar archive format, lzop compression, or zip compression are not supported if you load files with unsupported archive or compression formats, the load might stop or produce unexpected results the performance of loading into {{ocient}} is greatly improved when records are presented in a well ordered time sequence this sequence allows more efficient creation of segments and sorting of records into {{timekey}} buckets for this reason, the lat has options to define how files in a file group should be sorted prior to loading file group filtering and sorting example if you want to load all the files under a certain directory, the file group configuration is basic "my file group" { "prefix" "/dir1", "file matcher syntax" "glob", "file matcher pattern" " ", "sort type" "lexicographic" } this file group configuration will select all files under /dir1 (including those under its subdirectories) and sort them lexicographically however, if a user needs to selectively choose, filter, and sort the files selected, this process provides flexible options for doing so for an individual file group in a file type load, there are multiple steps to select, rename, filter, and sort the files retrieved the files selected occur through this sequence prefix filtering file matching file renaming timestamp extract pattern matching (only extract timestamp ) range matching final list sorting for example, given these files from a local file system /dir1/file json /dir1/files1/year=2021/month=01/day=01/files2/10 00 00 csv /dir1/files1/year=2021/month=01/day=01/file json /dir1/files1/year=2021/month=01/day=01/files3/11 00 00 json /dir1/files1/year=2021/month=01/day=01/files2/12 00 00 json /dir1/files1/year=2021/month=01/day=01/files2/14 00 00 json /dir1/files1/year=2021/month=01/day=01/files2/13 00 00 json and a file group configuration "my group" { "prefix" "/dir1/files1/", "file matcher syntax" "regex", "file matcher pattern" "/dir1/files1/year=(\\\d{4})/month=(\\\d{2})/day=(\\\d{2})/files\\\d/(\\\d{2}) (\\\d{2}) (\\\d{2}) json", "rename format" "{1} {2} {3} {4} {5} {6}", "sort type" "extract timestamp", "path timestamp pattern" "yyyy mm dd hh mm ss", "start time" "2021 01 01t13 00 00", "stop time" "2021 01 01t14 00 01" } step 1 prefix filtering prefix filtering occurs first and includes only files that are in the matching prefix paths the prefix is the path part following the bucket for s3 types, and it is the path to files in a local file type load "prefix" "/dir1/files1/" result /dir1/file json is filtered out because it is not in the prefix step 2 file matching next, file matching uses a pattern to match files that should be included file matcher patterns apply to the full path of the file including any prefix defined in the prior step "file matcher syntax" "regex" "file matcher pattern" "/dir1/files1/year=(\\\d{4})/month=(\\\d{2})/day=(\\\d{2})/files\\\d/(\\\d{2}) (\\\d{2}) (\\\d{2}) json" result /dir1/files1/year=2021/month=01/day=01/files2/10 00 00 csv is filtered out because it does not end in json /dir1/files1/year=2021/month=01/day=01/file json is filtered out because it is not in the subdirectory files1 step 3 file renaming file renaming is a step that can be useful in a case where the selected files have disparate file names that would make it difficult to extract timestamps from files are only renamed internally to the lat to facilitate the file selection process; they are not actually renamed locally or on s3 by unifying each file name into a consistent format, this step should make it easier to use the extract timestamp sort type, or even to lexicographically sort on some parameter within the file name itself "file matcher pattern" "/dir1/files1/year=(\\\d{4})/month=(\\\d{2})/day=(\\\d{2})/files\\\d/(\\\d{2}) (\\\d{2}) (\\\d{2}) json" "rename format" "{1} {2} {3} {4} {5} {6}", result the files that match the file matcher pattern regular expression are renamed using the rename format current files at this step /dir1/files1/year=2021/month=01/day=01/files3/11 00 00 json /dir1/files1/year=2021/month=01/day=01/files2/12 00 00 json /dir1/files1/year=2021/month=01/day=01/files2/14 00 00 json /dir1/files1/year=2021/month=01/day=01/files2/13 00 00 json renamed file list that enters the next step 2021 01 01 11 00 00 2021 01 01 12 00 00 2021 01 01 14 00 00 2021 01 01 13 00 00 step 4 timestamp extract pattern matching a filter is applied when the sort type is extract timestamp the extract timestamp sort type extracts timestamps from the file’s filepath either a path timestamp pattern or a file timestamp pattern must be set if a filename does not match the set pattern, it will be filtered out "sort type" "extract timestamp" "path timestamp pattern" "yyyy mm dd hh mm ss", result in this example, a path timestamp pattern is set, which is a datetimeformatter https //docs oracle com/javase/8/docs/api/java/time/format/datetimeformatter html pattern that will extract a timestamp from the starting from the beginning of the filename, used for the next step no files are filtered here because the rename step above was able to unify filename formats in a way that the path timestamp pattern could match on all the files internal to the lat, the files are associated with their datetimes, which will be used in the next step filename > datetime extracted in iso 8601 format 2021 01 01 11 00 00 > 2021 01 01t11 00 00z 2021 01 01 12 00 00 > 2021 01 01t12 00 00z 2021 01 01 14 00 00 > 2021 01 01t14 00 00z 2021 01 01 13 00 00 > 2021 01 01t13 00 00z there are potential pitfalls with using the extract timestamp sort type , described in the docid\ z8gjws65x2ybq02bns gm section step 5 range matching range matching occurs next based on the chosen sort type algorithm for extract timestamp or metadata , a timestamp is associated with each file for sorting a start and end time can be optionally provided to limit files to select for lexicographic , the filename itself is used for sorting a start and end file name can be provided to limit the files that are selected "sort type" "extract timestamp" "path timestamp pattern" "yyyy mm dd hh mm ss", "start time" "2021 01 01t13 00 00" "stop time" "2021 01 01t14 00 01"t result the file 2021 01 01 12 00 00 is filtered out because it is not in the start and end time ranges step 6 final list sorting finally, the fully filtered file list contains two files from your original list these are then sorted according to the sort type algorithm and partitioned across the workers for loading remaining files before sorting 2021 01 01 14 00 00 2021 01 01 13 00 00 final list sorted, renamed files 2021 01 01 13 00 00 2021 01 01 14 00 00 (original file names) /dir1/files1/year=2021/month=01/day=01/files2/13 00 00 json /dir1/files1/year=2021/month=01/day=01/files2/14 00 00 json example s3 source configuration the names defined for file groups ("my file group" in this example) should match the file groups used in the transform configuration section { "source" { "type" "s3", "endpoint" "http //some endpoint/", "bucket" "my bucket", "file groups" { "my file group" { "prefix" "some/prefix/", "file matcher syntax" "glob", "file matcher pattern" " json", "sort type" "extract timestamp", "path timestamp pattern" "'dir1/files1/'yyyy' 'mm' 'dd't'hh' 'mm' 'ss", } }, "compression" "gzip" }, "transform" { "file groups" { "my file group" { "tables" { } } } } } example local file source configuration the names defined for file groups ("my file group" in this example) should match the file groups used in the transform configuration section { "source" { "type" "local", "file groups" { "my file group" { "prefix" "/path/to/data/directory/", "file matcher syntax" "glob", "file matcher pattern" " json", "sort type" "extract timestamp", "path timestamp pattern" "'dir1/files1/'yyyy' 'mm' 'dd't'hh' 'mm' 'ss", } }, "compression" "gzip" }, "transform" { "file groups" { "my file group" { "tables" { } } } } } common file source configuration source file groups an object that maps file group names to their corresponding configuration objects type object required yes default source file groups \<file group name> the configuration object for a file group type object required yes default source file groups \<file group name> prefix for an s3 source, the prefix is used to get a subset of s3 objects from the bucket a more specific prefix can improve time required to list files in s3 sources for a local file source, prefix is an absolute or relative path from the working directory of the lat in either case, the path can be a file or a directory if the path is a directory, lat tries to load all files recursively under this path when looking for files, lat follows the symbol links lat only loads regular files that are readable and not hidden note that shell specific expansions like are not supported you can include multiple prefixes for a single file group if you specify this property as an array of strings in that case, the lat loads all files that are under any of the prefixes in that array the system loads files that match more than one prefix within a single file group only once for a local file source, the path used for prefix is the path on the server where the lat is running, which might not be the same machine where you are running the lat client command line interface to create your pipeline type string or string array required no default "" source file groups \<file group name> file matcher syntax the file matcher syntax defines the type of syntax used by the file matcher pattern setting, which includes matching files in the file group the pattern is applied to the fully qualified file name in the list of files found under the prefix see the file matcher pattern description for examples and complete syntax details for glob and regex options type string required no, although it is not valid to have only one of file matcher syntax or file matcher pattern be provided default glob allowed values glob a simplified pattern matching system based on wildcards regex a regular expression syntax source file groups \<file group name> file matcher pattern the pattern used to select files from the prefix filtered list files that match the pattern are included in the file group the fully qualified file name (path and filename under the prefix) are matched files that do not match are excluded from the file group this pattern is also used in the case of lat file renaming if a rename format is provided, the lat will attempt to internally rename files using capture groups in the file matcher pattern if capture groups are provided in the file matcher pattern , they will be extracted and placed into the rename format in order to generate a renamed file capture groups do not need to be named; the captured values are assigned to the rename format sequentially, i e the first capture group in the file matcher pattern will be placed in {1} in the rename format files are only renamed internally to the lat to facilitate the file selection process; they are not actually renamed locally or on s3 this matched file list can be further filtered using an extract timestamp's pattern and the start/end ranges of any file group type algorithm see an expanded example in the /#file group filtering and sorting example section a double backslash within the regular expression is necessary to generate valid escape characters for the pipeline json for example, the capture group (\d{4}) needs to be escaped as (\\\d{4}) type string required no, although it is not valid to have only one of file matcher syntax or file matcher pattern be provided default example with a file matcher syntax of glob and file matcher pattern of json , all files (after prefix filtering) that end in json will be selected a in the glob based pattern ensures that any pattern including directory boundaries are matched a pattern does not match the directory character / example with a file matcher syntax of regex , a file matcher pattern of auctions 2021 json would use a regular expression to ensure that files such as auctions 12 01 2021 json and auctions 12 02 2021 json would be selected see getpathmatcher() https //docs oracle com/en/java/javase/17/docs/api/java base/java/nio/file/filesystem html#getpathmatcher(java lang string) for complete syntax details of the glob and regex options common matcher patterns matcher syntax pattern meaning glob matches zero or more characters without crossing directory boundaries glob matches zero or more characters crossing directory boundaries glob ? matches exactly one character of a name component glob \[] matches any of the characters in the bracket expression (e g , \[abc]) supports ranges regex matches any character regex \d matches any digit regex \d matches any non digit regex \[] matches any of the characters in the bracket expression (e g , \[abc]) supports ranges regex matches the preceding character zero or more times regex + matches the preceding character one or more times regex ? matches the preceding character once or not at all source file groups \<file group name> rename format a java ® messageformat https //docs oracle com/javase/7/docs/api/java/text/messageformat html string captured groups from the file matcher pattern will be placed into the rename format string’s format elements corresponding to the order they were captured the renamed file is used for subsequent sorting of files when using lexicographic sort, the renamed file is used instead of the original filename for extract timestamp sorting, the path timestamp pattern is applied against the renamed file this is best elucidated with an example filename 20210102 json file matcher pattern (\\\d{4})(\\\d{2})(\\\d{2}) rename format "file {1} {3} {2}" "sort type" "extract timestamp" "path timestamp pattern" "'file 'yyyy' 'mm' 'dd" renames the file to "file 2021 02 01" the "path timestamp pattern" extracts an iso 8601 datetime of 2021 02 01t00 00 00z from the renamed file files are only renamed internally to the lat to facilitate the file selection process; they are not actually renamed locally or on s3 type string required no default none if rename format is not provided, files will not undergo a rename step source file groups \<file group name> sort type defines the sorting algorithm used for sorting the files selected for loading in this file group by time files should be sorted in time order for best loading performance and the creation of efficient ocient segments sort type can be extract timestamp , metadata , or lexicographic , and different settings apply to each of these selections type string required yes default none allowed values extract timestamp extract the timestamp from the file name or file path information either path timestamp pattern or file timestamp pattern is required when choosing this sort type metadata extract the timestamp from the source file’s metadata this option uses the file last modified time when sorting the list lexicographic sort files based on the alphanumeric sort of files in the file group using the full path and filename as the sorting key the system breaks the tie among files in a lexicographic way by using their fully qualified original file names source file groups \<file group name> path timestamp pattern a format string with datetimeformatter https //docs oracle com/en/java/javase/17/docs/api/java base/java/time/format/datetimeformatter html patterns used to extract a datetime from the path portion of a file’s fully qualified filename this timestamp is used with the extract timestamp sort type to order the selected files in the file group prior to loading it is attempted on every file’s fully qualified filename under the prefix ; files that do not match the pattern will be skipped and the remaining subset will be used for sorting type string required one of path timestamp pattern or file timestamp pattern is required for an extract timestamp sort type file group default none source file groups \<file group name> file timestamp pattern a format string with datetimeformatter https //docs oracle com/en/java/javase/17/docs/api/java base/java/time/format/datetimeformatter html patterns used to extract a datetime from the base filename of a file’s fully qualified filename this timestamp is used with the extract timestamp sort type to order the selected files in the file group prior to loading it is attempted on every file’s base filename under the prefix ; files that do not match the pattern will be skipped and the remaining subset will be used for sorting type string required one of path timestamp pattern or file timestamp pattern is required for an extract timestamp sort type file group default none source file groups \<file group name> start time an iso 8601 compliant date or datetime, used as the lower bound to filter for files in a extract timestamp or metadata sort type file group datetimes are utc unless a timezone is provided inclusive type string required no default instant min source file groups \<file group name> stop time an iso 8601 compliant date or datetime, used as the upper bound to filter for files in a extract timestamp or metadata sort type file group datetimes are utc unless a timezone is provided exclusive type string required no default instant max extract timestamp default values with an extract timestamp sort type , if a certain time unit is not provided, they will default to their respective value in `1970 01 01t00 00 00 000000` example filename `july03 json` `path timestamp pattern` or `file timestamp pattern` `mmmmdd' json'` the full timestamp extracted from this filename will be 1970 07 03t00 00 00 000000 utc potential pitfalls pitfall 1 this default value should be considered when a start time or stop time is provided but a certain time unit is not present within the filename example capture files ranging from july 1st through july 3rd filenames july01 json, july02 json, july03 json "file timestamp pattern" "mmmmdd' json'" "start time" "1970 07 01" "stop time" "1970 07 04" note that 1970 was used as the year in the start time and stop time , but the system could not extract a year from the filenames an alternative strategy is to use the file renaming step to insert missing dates or times pitfall 2 only applicable to using the extract timestamp sort type with start time and stop time consider the time unit granularities between your pattern and start/stop times example filenames dir1/2021/01/01/00 00 00 json, dir1/2021/01/01/01 00 00 json, dir1/2021/01/01/05 00 00 json "path timestamp pattern" "'dir1/'yyyy'/'mm'/'dd'/'" "start time" "2021 01 01t00 00 00" "stop time" "2021 01 01t04 00 00" given this configuration, one might expect that the files that are hour 4 and later would be filtered out however, because the pattern does not extract the times from the filenames, each file’s extracted datetime in iso 8601 format is 2021 01 01t 00 00 00utc , so no files are outside the range and filtered make sure to extract as much information about the datetime as necessary source file groups \<file group name> start file a string, used as the lower bound to filter for files in a lexicographic sort type file group inclusive type string required no default none , but a lexicographic sort type file group will not check for a lower bound if this is not present source file groups \<file group name> stop file a string, used as the upper bound to filter for files in a lexicographic sort type file group inclusive type string required no default none , but a lexicographic sort type file group will not check for an upper bound if this is not present source file groups \<file group name> compression the compression method for the files in this file group if this value is set, it will override source compression ; if not set, it will inherit source compression see source compression for available options type string required no default null source file groups \<file group name> bucket this setting is available only for s3 sources if specified, this setting overrides the source wide bucket for this file group this allows loading from multiple buckets at once within the same pipeline if a bucket is specified for every file group, you can omit the source bucket type string required no default null source partitions the number of partitions for each file group for example, if there are 2 file groups and partitions is set to 32, then a total of 64 partitions will exist, with 32 for each file group the partition index is 0 based for each file group in the above example, each file group will have partitions \[0, 31] when partitions is specified for a load, its value must not change for the lifetime of this load, otherwise records can not load correctly when using the lat client, partitions will be automatically assigned by the client if it is set in the source configuration provided by the user, it will be overwritten type int required no default set by the lat client source partitions assigned a 2 element array indicating the minimum and maximum partition indices that this lat node owns, inclusive on both ends for example, \[0, 15] means that the node owns partitions 0, 1, 2, …, 15 the first element should be greater than or equal to 0 the second element should be greater than or equal to the first element, and less than the number of partitions when using the lat client, partitions assigned will be automatically assigned by the client if it is set in the source configuration provided by the user, it will be overwritten type int array required no default set by the lat client source compression the compression method for the files currently, only none and gzip are supported none means that the files are not compressed note that this value sets the default compression method for all the files from this source, and it can be overridden by source file groups compression per file group type string required no default none source chunk size the size of a chunk when fetching data for example, if you have a 40 mb file and the chunk size is 16 mb, you can issue 3 requests sequentially to get the file the unit is mb the value must be greater than 0 the chunk size should be larger than or equal to the maximum record size type int required no default 16 source buffer size the total buffer size for all assigned partitions this value will be divided by the total number of assigned partitions to calculate each partition’s buffer size for example, if this value is 4096, source partitions is 16, source partitions assigned is \[0, 3] , and you have 2 file groups, then you have a total of (3 0+1) 2=8 assigned partitions, and each partition will get 4096/8=512 mb as the buffer size the unit is mb the value must be greater than 0 the buffer size per partition must be at least twice the value of source chunk size type int required no default 4096 source max fetch concurrency the maximum concurrency when fetching files for each partition setting this configuration too high can result in thread contention the value must be greater than 0 type int required no default 2 s3 source specific configuration source endpoint the endpoint for the s3 instance, usually starting with http or https it can be an ip address or a domain for example, for aws s3 in the us east 2 region, endpoint would be https //s3 us east 2 amazonaws com the s3 source supports virtual hosted style access and path style access the s3 // protocol based access is not supported for more details, see https //docs aws amazon com/amazons3/latest/userguide/access bucket intro html type string required yes default source region the region for the s3 instance this is mostly used for aws s3 instances type string required no default us east 2 source bucket the bucket from which to get s3 objects you can override the bucket on a per file group basis by setting the source file groups \<file group name> bucket this setting is required unless it is specified for every file group type string required no default source path style access whether to force path style access for the s3 instance if set to false , lat will try to use the virtual hosted style (using dns subdomains), and falls back to the path style access aws s3 is deprecating path style access according to https //aws amazon com/blogs/aws/amazon s3 path deprecation plan the rest of the story/ type boolean required no default false source access key id the access key for the s3 instance this should be used together with the source secret access key setting if either of them is absent, lat will default to the next item in the credentials hierarchy see /#s3 credentials hierarchy type string required no default null source secret access key the access secret for the s3 instance this should be used together with the source access key id setting if either of them is absent, lat will default to the next item in the credentials hierarchy see /#s3 credentials hierarchy type string required no default null source session token temporary credentials can be made with a combination with the existing secret id and secret key, and an additional session token note that this is only used when both source access key id and source secret access key are specified type string required no default null source retries configures the number of retries that will be attempted when download from s3 source fails type int required no default default attempts specified in aws sdk source backoff strategy base delay seconds configures the base delay for the backoff strategy of the s3 client in seconds type int required no default 1 source backoff strategy max backoff seconds configures the maximum backoff delay for the backoff strategy of the s3 client in seconds type int required no default for the default delay, see the https //github com/aws/aws sdk java v2/blob/master/core/sdk core/src/main/java/software/amazon/awssdk/core/internal/retry/sdkdefaultretrysetting java#l66 source max pending connection acquires configures the maximum number of pending acquires allowed by the netty client type int required no default for the default number of acquires, see the https //github com/aws/aws sdk java v2/blob/master/http client spi/src/main/java/software/amazon/awssdk/http/sdkhttpconfigurationoption java#l146 source netty read timeout seconds configures the read timeout, in seconds, of the netty client when you set this value to zero, the system disables the read timeout the lat configures read timeouts to be tried again by the s3 client type int required no default for the default timeout value, see the https //github com/aws/aws sdk java v2/blob/master/http client spi/src/main/java/software/amazon/awssdk/http/sdkhttpconfigurationoption java#l136 source connection timeout seconds configures the amount of time in seconds for the netty client to wait when initially establishing a connection before giving up and timing out the lat configures connection timeouts to be retryable by the s3 client type int required no default for the default timeout value, see the https //github com/aws/aws sdk java v2/blob/master/http client spi/src/main/java/software/amazon/awssdk/http/sdkhttpconfigurationoption java#l134 source connection acquisition timeout seconds configures the amount of time in seconds for the netty client to wait when acquiring a connection from the pool before giving up and timing out the lat configures connection acquisition timeouts to be retryable by the s3 client type int required no default for the default timeout value, see the https //github com/aws/aws sdk java v2/blob/master/http client spi/src/main/java/software/amazon/awssdk/http/sdkhttpconfigurationoption java#l137 source requester pays configures whether or not the requester (i e the lat user) should be charged for downloading data from the s3 requester pays buckets this configuration should be set to true whenever you request from a requester pays bucket, or else the request fails and the bucket owner is charged for the request type boolean required no default falsed s3 credentials hierarchy the s3 source configuration supports the following hierarchy to obtain s3 credentials if the lat does not obtain the credentials at a level, the lat tries the lower level level 1 pipeline configuration level 2 https //docs aws amazon com/sdk for java/v1/developer guide/credentials html level 3 (default) anonymous access you can choose the level to store the credentials where level 1 is the highest higher levels take precedence for credential storage load from a kafka source the {{kafka}} source allows lat to connect to a kafka cluster and process records from one or more topics you can reference each kafka topic in the docid\ b7shicmwe2h7o1xfxjny and can load into one or more tables in addition to the lat specific configuration, this source supports configuration passthrough to the underlying https //kafka apache org/28/documentation html#consumerconfigs which this source uses there are some library configurations which are not allowed, and will be noted in this table example kafka source configuration { "source" { "type" "kafka", "kafka" { "bootstrap servers" "127 0 0 1 9092", "auto offset reset" "earliest" } } } kafka configuration source end offsets polling duration frequency with which to poll for end offsets for lag calculation, in milliseconds type int required no default 30000 source kafka kafka consumer configuration object see https //kafka apache org/28/documentation html#consumerconfigs for details type object required yes default source kafka required keys kafka bootstrap servers https //kafka apache org/28/documentation html#consumerconfigs bootstrap servers this key is required for connection information to the kafka servers common keys kafka auto offset reset https //kafka apache org/28/documentation html#consumerconfigs auto offset reset this key is commonly set to earliest to consume all of the data in an existing kafka topic disallowed kafka configuration the following kafka consumer configuration options are not allowed kafka enable auto commit lat will always internally set this configuration to false ; it is critical for correct operation to do so kafka key deserializer lat will always internally configure deserialization kafka value deserializer lat will always internally configure deserialization if any of the disallowed kafka library configurations are set, a warning will be logged and their configurations will be ignored defaulted kafka configuration kafka group id if unset lat will configure group id internally using the value of ocient lat \[$pipeline id] most users should leave this unset unless they want to explicitly control the group id that is being used related links docid\ tt6tfoulap0mt aycm2ka