Shell

Configuration

Persistor is deployed as a set of Kubernetes resources, each of which is highly configurable.

The tables below contain the variables that can be configured as part of the Persistor’s configuration.

Persistor Core

Below are the variables used to configure the main Persistor component. The broker-specific configuration options should be taken into consideration along with the “Common” variables.

Common Configuration

Below is the shared configuration used between all Persistor types.

Variable	Example Value	Possible Values	Description	Required
READER_TYPE	“pubsub”	[“pubsub”, “kafka”, “servicebus”]	Type of broker used for reader	yes
SENDER_TYPE	“pubsub”	[“pubsub”, “kafka”, “servicebus”]	Type of broker used for sender	no (yes if reader is kafka or if indexer is enabled)
STORAGE_TYPE	“gcs”	[“azure”, “gcs”]	Type of storage used	yes
STORAGE_DESTINATION	“my-bucket”	N/A	Name of GCS bucket or ABS container	yes
STORAGE_WRITETIMEOUT	“my-bucket”	N/A	Timeout of write operation (message nacked or sent to dead letter if exceeded	no (defaults to 20s)
STORAGE_MASK	“my-bucket”	N/A	Structure of the path under which batches are stored	yes
STORAGE_CUSTOMVALUES	“org:company,dept:sales”	Comma-separated list of pairs of the form key:value	Values to include in file path. If it contains key1:value1 and STORAGE_MASK has key1, then path will contain value1.	no
STORAGE_VERSIONKEY	“my-bucket”	N/A	Key in metadata that messages get grouped by.	no
BATCHSETTINGS_BATCHSIZE	“100”	N/A	Maximum number of messages in a batch.	no (defaults to 10000)
BATCHSETTINGS_BATCHMEMORY	“1000000”	N/A	Time to wait before writing batch if it is not full.	no (defaults to 30s)
BATCHSETTINGS_BATCHTIMEOUT	“60s”	N/A	Maximum bytes in batch.	no (defaults to 1000000)
STORAGE_TOPICID	“my_topic”	N/A	Topic’s name	yes
STORAGE_PREFIX	“my-bucket”	N/A	Prefix to be given to all files stored to the chosen target blob storage.	yes (but can be “”)
STORAGE_EXTENSION	“avro”	N/A	Extension of the files stored to blob storage.	yes
DEADLETTERENABLED	“true”	[“true”, “false”]	Whether messages will be sent to dead letter upon error	no (yes, “true” if reader is kafka, otherwise defaults to false)
SENDER_DEADLETTERTOPIC	“persistor-dltopic”	N/A	Dead letter topic name	no (yes if reader is kafka)
MINIMUM_LOG_LEVEL	“WARN”	[“WARN”, “ERROR”, “INFO”]	The minimal level used in logs.	Default: “INFO”

Enabling the Indexer Plugin

Below are the variables to be used when deploying the Persistor alongside the Indexer plugin.

Variable	Example Value	Possible Values	Description	Required
INDEXERENABLED	“true”	[“true”, “false”]	Whether to send messages to Indexer topic or not.	Default: “false”
SENDER_TOPICID	“persistor-indexertopic”	N/A	ID of the topic used for communication between Persistor and Indexer.	no (yes if indexer is enabled)

Additional Configuration

Below are more advanced configuration options available as part of the Persistor’s configuration suite.

Variable	Example Value	Possible Values	Description	Required
STORAGE_MASK	""	any subset of (year, month, day, hour, version) in any order separated by /	Path within a blob storage where messages will be saved.	Default: “year/month/day/hour/”
OBJECT_VERSION	""	N/A	Key in a message’s metadata which value represents the message version.	Default: ""

GCP Pub/Sub

GCP Configuration

Below are the variables relevant to users utilizing Google PubSub as the broker type messages are being pulled from and/or using Google Cloud Storage as the destination storage option.

PubSub used

Variable	Example Value	Possible Values	Description	Required
READER_PUBSUB_PROJECTID	“my-gcp-project”	N/A	ID of the GCP project you’re working on	yes
SENDER_PUBSUB_PROJECTID	“my-gcp-project”	N/A	ID of the GCP project you’re working on	no (yes if indexer or dead letter are enabled)
READER_PUBSUB_SUBID	“persistor-sub”	N/A	Pubsub subscription that messages come from	yes

Azure

Azure Configuration

Below are the variables relevant to users utilizing Azure Service Bus as the broker type messages are being pulled from and/or using Azure Blob Storage as the destination storage option.

Common Configuration

Variable	Example Value	Possible Values	Description	Required
AZURE_CLIENT_ID	“19b725a4-1a39-5fa6-bdd0-7fe992bcf33c”	N/A	Client ID of your Service Principal	yes
AZURE_CLIENT_ID	“19b725a4-1a39-5fa6-bdd0-7fe992bcf33c”	N/A	Client ID of your Service Principal	yes
AZURE_CLIENT_ID	“19b725a4-1a39-5fa6-bdd0-7fe992bcf33c”	N/A	Client ID of your Service Principal	yes

Azure Service Bus used

Variable	Example Value	Possible Values	Description	Required
READER_SERVICEBUS_CONNECTIONSTRING	“Endpoint=sb://…”	N/A	Connection string for the service bus namespace	yes
READER_SERVICEBUS_TOPICID	“persistor-topic”	N/A	Service bus topic name	yes
READER_SERVICEBUS_SUBID	“persistor-subscription”	N/A	Service bus subscription name	yes
SENDER_SERVICEBUS_CONNECTIONSTRING	“Endpoint=sb://…”	N/A	Connection string for the sender service bus namespace	no (yes is indexer or deadletter are enabled)

Azure Blob Storage used

Variable	Example Value	Possible Values	Description	Required
STORAGE_STORAGEACCOUNTID	“persistor-storage”	N/A	ID of the storage account	yes

Kafka

Kafka Configuration

Below are the variables relevant to users utilizing Apache Kafka as the broker type messages are being pulled from. Should be used in conjunction with GCS and Azure Blob Storage as the destination storage of choice.

Kafka used

Variable	Example Value	Possible Values	Description	Required
READER_KAFKA_ADDRESS		N/A	Address of the kafka broker	yes
READER_KAFKA_GROUPID	“persistor”	N/A	Consumer group’s name.	yes
READER_KAFKA_TOPICID		N/A	Kafka source topic name	no
SENDER_KAFKA_ADDRESS		N/A	Address of the kafka broker for sender	no

Indexer

Below are the variables used to configure the Indexer component – the component responsible for pulling the Indexer metadata generated by the Persistor from the dedicated broker configuration and storing it in the NoSQL (Mongo) database for resubmission purposes.

An Indexer “type” is determined based on the message broker used as the communication channel to receive the required metadata.

Common Configuration

Below is the shared configuration between all Indexer types.

Variable	Example Value	Possible Values	Description	Required
READER_TYPE	“pubsub”	[“pubsub”, “kafka”, “servicebus”]	Type of broker used	yes
SENDER_TYPE	“pubsub”	[“pubsub”, “kafka”, “servicebus”]	Type of broker used for sender	no (yes if reader is kafka or if indexer is enabled)
BATCHSETTINGS_BATCHSIZE	“10”	N/A	Maximum number of messages in a batch.	no (defaults to 10000)
BATCHSETTINGS_BATCHMEMORY	“1000000”	N/A	Time to wait before writing batch if it is not full.	no (defaults to 30s)
BATCHSETTINGS_BATCHTIMEOUT	“60s”	N/A	Maximum bytes in batch.	no (defaults to 1000000)
DEADLETTERENABLED	“true”	[“true”, “false”]	Whether messages will be sent to dead letter upon error	no (yes, “true” if reader is kafka, otherwise defaults to false)
SENDER_DEADLETTERTOPIC	“persistor-dltopic”	N/A	Dead letter topic name	no (yes if reader is kafka)
MONGO_CONNECTIONSTRING	“mongodb://mongo-0.mongo-service.dataphos:27017”	N/A	MongoDB connection string	yes
MONGO_DATABASE	“indexer_db”	N/A	Mongo database name to store metadata in	yes
MONGO_COLLECTION	“indexer_collection”	N/A	Mongo collection name (will be created if it doesn’t exist)	yes
MINIMUM_LOG_LEVEL	“WARN”	[“WARN”, “ERROR”, “INFO”]	The minimal level used in logs.	Default: “INFO”

Advanced Configuration

GCP

GCP Configuration

Below are the configuration options if Google PubSub is used as the communication channel between the components.

Variable	Example Value	Possible Values	Description	Required
READER_PUBSUB_PROJECTID	“my-gcp-project”	N/A	ID of the GCP project you’re working on	yes
SENDER_PUBSUB_PROJECTID	“my-gcp-project”	N/A	ID of the GCP project you’re working on	no (yes dead letter is enabled)
READER_PUBSUB_SUBID	“indexer-sub”	N/A	Pubsub subscription that messages come from	yes

Azure

Azure Service Bus Configuration

Below are the configuration options if Azure Service Bus is used as the communication channel between the components.

Variable	Example Value	Possible Values	Description	Required
READER_SERVICEBUS_CONNECTIONSTRING	“Endpoint=sb://…”	N/A	Connection string for the service bus namespace	yes
READER_SERVICEBUS_TOPICID	“persistor-indexer-topic”	N/A	Service bus topic name	yes
READER_SERVICEBUS_SUBID	“indexer-subscription”	N/A	Service bus subscription name	yes
SENDER_SERVICEBUS_CONNECTIONSTRING	“Endpoint=sb://…”	N/A	Connection string for the sender service bus namespace	no (yes is indexer or deadletter are enabled)

Kafka

Kafka Configuration

Below are the configuration options if Apache Kafka is used as the communication channel between the components.

Variable	Example Value	Possible Values	Description	Required
READER_KAFKA_ADDRESS		N/A	Address of the kafka broker	yes
READER_KAFKA_GROUPID	“indexer”	N/A	Consumer group’s name.	yes
READER_KAFKA_TOPICID		N/A	Kafka source topic name	no
SENDER_KAFKA_ADDRESS		N/A	Address of the kafka broker for sender	no

Indexer API

The Indexer API is created on top of the initialized Mongo database and used to query the Indexer metadata.

Simple Configuration

Below are the minimum configuration options required for the Indexer API to work.

Variable	Example Value	Possible Values	Description	Required
CONN	“mongodb://mongo-0.mongo-service:27017”	N/A	MongoDB connection string.	yes
DATABASE	“indexer_db”	N/A	MongoDB database from which Indexer will read messages	yes

Advanced Configuration

Below are additional configuration options offered by the Indexer API.

Variable	Example Value	Possible Values	Description	Required
MINIMUM_LOG_LEVEL	“WARN”	[“WARN”, “ERROR”, “INFO”]	The minimal level used in logs.	Default: “INFO”
SERVER_ADDRESS	“:8080”	N/A	Port on which Indexer API will listen for traffic	Default: “:8080”
USE_TLS	“false”	[“true”, “false”]	Whether to use TLS or not	Default: “false”
SERVER_TIMEOUT	“2s”	N/A	The amount of time allowed to read request headers	Default: “2s”

Resubmitter API

The Resubmitter API is connected to the Indexer API for efficient fetching of data. It is dependent on the type of storage it is meant to query and the destination broker.

Common Configuration

Below are the common configuration options for the Resubmitter API.

Variable	Example Value	Possible Values	Description	Required
INDEXER_URL	“http://34.77.44.130:8080/”	N/A	The URL of the Indexer API with which the Resubmitter will communicate with	yes
STORAGE_TYPE	“gcs”	[“gcs”, “abs”]	Type of storage used by Persistor	yes
PUBLISHER_TYPE	“pubsub”	[“servicebus”, “kafka”, “pubsub”]	Type of broker used	yes
SERVER_ADDRESS	“:8081”	N/A	Port on which Resubmitter will listen for traffic	Default: “:8081”

Advanced Configuration

Below are the additional configuration options offered by the Resubmitter API.

Variable	Example Value	Possible Values	Description	Required
MINIMUM_LOG_LEVEL	“WARN”	[“WARN”, “ERROR”, “INFO”]	The minimal level used in logs.	Default: “INFO”
RSB_META_CAPACITY	“20000”	N/A	Maximum number of messages which Indexer will return from MongoDB at once	Default: “10000”
RSB_FETCH_CAPACITY	“200”	N/A	Maximum number of workers in Resubmitter that are used for fetching from storage	Default: “100”
RSB_WORKER_NUM	“3”	N/A	Number of workers in Resubmitter that are used for packaging records	Default: “2”
RSB_ENABLE_MESSAGE_ORDERING	“false”	[“true”, “false”]	Whether to publish messages using ordering keys	Default: “false”
USE_TLS	“false”	[“true”, “false”]	Whether to use TLS or not	Default: “false”
SERVER_TIMEOUT	“2s”	N/A	The amount of time allowed to read request headers	Default: “2s”

GCP

GCP Configuration

Below are the options to be configured if Google PubSub is used as the destination broker for resubmission and/or Google Cloud Storage is the data source used for the resubmission.

Common Configuration

Variable	Example Value	Possible Values	Description	Required
PUBSUB_PROJECT_ID	“my-gcp-project”	N/A	ID of the GCP project	yes

PubSub as Target Broker

Variable	Example Value	Possible Values	Description	Required
PUBLISH_TIMEOUT	“15s”	N/A	The maximum time that the client will attempt to publish a bundle of messages.	Default: “15s”
PUBLISH_BYTE_THRESHOLD	“50”	N/A	Publish a batch when its size in bytes reaches this value.	Default: “50”
PUBLISH_COUNT_THRESHOLD	“50”	N/A	Publish a batch when it has this many messages.	Default: “50”
PUBLISH_DELAY_THRESHOLD	“50ms”	N/A	Publish a non-empty batch after this delay has passed.	Default: “50ms”
NUM_PUBLISH_GOROUTINES	“52428800”	N/A	The number of goroutines used in each of the data structures that are involved along the the Publish path.	Default: “52428800”
MAX_PUBLISH_OUTSTANDING_MESSAGES	“800”	N/A	MaxOutstandingMessages is the maximum number of buffered messages to be published.	Default: “800”
MAX_PUBLISH_OUTSTANDING_BYTES	“1048576000”	N/A	MaxOutstandingBytes is the maximum size of buffered messages to be published.	Default: “1048576000”
PUBLISH_ENABLE_MESSAGE_ORDERING	“false”	[“true”, “false”]	Whether to publish messages using oredering keys	Default: “false”

Azure

Azure Configuration

Below are the options to be configured if Azure Service Bus is used as the destination broker for resubmission and/or Azure Blob Storage is the data source used for the resubmission.

Common Configuration

Variable	Example Value	Possible Values	Description	Required
AZURE_CLIENT_ID	“19b725a4-1a39-5fa6-bdd0-7fe992bcf33c”	N/A	Client ID of your Service Principal	yes
AZURE_TENANT_ID	“38c345b5-1b40-7fb6-acc0-5ab776daf44e”	N/A	Tenant ID of your Service Principal	yes
AZURE_CLIENT_SECRET	“49d537a6-8a49-5ac7-ffe1-6fe225abe33f”	N/A	Client secret of your Service Principal	yes

Service Bus as Target Broker

Variable	Example Value	Possible Values	Description	Required
SB_CONNECTION_STRING	“Endpoint=sb://foo.servicebus.windows.net/;SharedAccessKeyName=Roo”	N/A	Connection string for Azure Service Bus	yes

Azure Blob Storage used as Resubmission Source

Variable	Example Value	Possible Values	Description	Required
AZURE_STORAGE_ACCOUNT_NAME	mystorageaccount	N/A	Name of the Azure Storage Account.	yes

Kafka

Kafka Configuration

Below are the options to be configured if Apache Kafka is used as the destination broker for resubmission.

Kafka as Target Broker

Variable	Example Value	Possible Values	Description	Required
KAFKA_BROKERS	“localhost:9092”	N/A	Comma-separated list of at least one broker which is a member of the target cluster	yes
KAFKA_USE_TLS	“false”	[“true”, “false”]	Whether to use TLS or not	Default: “false”
KAFKA_USE_SASL	“false”	[“true”, “false”]	Whether to use SASL or not	Default: “false”
SASL_USERNAME	“sasl_user”	N/A	SASL username	yes if using SASL, otherwise no
SASL_PASSWORD	“sasl_pwd”	N/A	SASL password	yes if using SASL, otherwise no
KAFKA_SKIP_VERIFY	“false”	[“true”, “false”]	Controls whether a client verifies the server’s certificate chain and host name	Default: “false”
KAFKA_DISABLE_COMPRESSION	“false”	[“true”, “false”]	Whether to use message compression or not	Default: “false”
KAFKA_BATCH_SIZE	“40”	N/A	BatchSize sets the max amount of records the client will buffer, blocking new produces until records are finished if this limit is reached.	Default: “40”
KAFKA_BATCH_BYTES	“52428800”	N/A	BatchBytes parameter controls the amount of memory in bytes that will be used for each batch.	Default: “52428800”
KAFKA_BATCH_TIMEOUT	“10ms”	N/A	Linger controls the amount of time to wait for additional messages before sending the current batch.	Default: “10ms”
ENABLE_KERBEROS	“false”	[“true”, “false”]	Whether to enable Kerberos or not	Default: false
KRB_CONFIG_PATH	“/path/to/config/file”	N/A	Path to the Kerberos configuration file	yes, if kerberos is enabled
KRB_REALM	“REALM.com”	N/A	domain over which a Kerberos authentication server has the authority to authenticate a user, host or service.	yes, if kerberos is enabled
KRB_SERVICE_NAME	“kerberos-service”	N/A	Service name we will get a ticket for.	yes, if kerberos is enabled
KRB_KEY_TAB	“/path/to/file.keytab”	N/A	Path to the keytab file	yes, if kerberos is enabled
KRB_USERNAME	“user”	N/A	Username of the service principal	yes, if kerberos is enabled