DATAPHOS
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Overview

Publisher is a component developed for running a constant flow of ready-to-digest data packets across your Cloud infrastructure – sourced directly from a database or from an exposed API. Whereas most Change Data Capture (CDC) solutions focus on capturing and transmitting changes on individual tables, Publisher focuses on forming well-defined, structured objects directly at the source, publishing them to a distributed message broker of your choice and allowing a multitude of consumers to independently consume and process the data in their own individual pipelines.

Collectively, this is known as a create-transform-publish pattern. The structured objects generated by the Publisher are what we refer to as business objects – you are no longer pushing technical data on changes in your database, but curated, key business information.

As a user, you define the query and how the result of that query should be assembled into a business object with a minimalist and easy to understand YAML configuration. Publisher takes care of the rest – assembling the data, serializing it, encrypting it and – finally – publishing it. You start with a query:

SELECT 
    t.transaction_id,
    t.transaction_time,
    t.amount,
    t.user_id,
    u.user_name,
    u.user_country,
FROM
    transactions t
    JOIN user u ON t.user_id = u.user_id
WHERE 
    -- Publisher syntax for defining the time increments in which data should be captured.
    -- Publisher will automatically replace the rules here with actual timestamps,
    -- calculated based on configuration.
    transaction_time >= to_timestamp({{ .FetchFrom }}, 'YYYY-MM-DD HH24:MI:SS')
    AND 
    transaction_time <  to_timestamp({{ .FetchTo }}, 'YYYY-MM-DD HH24:MI:SS');

Publisher will automatically replace the placeholder variables with actual timestamps calculated based on configuration.

Our goal is to publish information on all transactions, aggregated by individual users.

We would therefore format the results of this query as:

definition: # Define general structure.
    - user:
        - user_id
        - user_name
        - user_country
    - transactions:
        - transaction_id
        - transaction_time
        - amount
groupElements: # Define what defines a unique group.
    - user_id
arrayElements: # Specify what is repeatable object.
    - transactions
idElements: # Information to use on the unique identifier the message will be sent with.
    - user_id

Serialized into JSON, this would result in the following data being published to your streaming platform:

[
    {
        "user": {
            "user_id": "some_user",
            "user_name": "Some Name",
            "user_country": "Croatia"
        },
        "transactions": [
            {
                "transaction_id": "...",
                "transaction_time": "...",
                "amount": 1.5
            },
            {
                "transaction_id": "...",
                "transaction_time": "...",
                "amount": 5.3
            }, 
            // ...
        ]
    },
    {
        "user": {
            "user_id": "some_other_user",
            "user_name": "Some Other Name",
            "user_country": "France"
        },
        "transactions": [
            {
                "transaction_id": "...",
                "transaction_time": "...",
                "amount": 5
            },
            // ...
        ]
    },
]

(You as a user may determine if you wish to have multiple user records in a message or have each message be its own singular record.)

Publisher Components

Publisher comes as a set of microservices. A single Publisher deployment is designed to enable multiple independent publishing jobs, configured and managed via a simple user interface.

Manager

The Manager component is a REST server that exposes API endpoints used for configuration management. All configuration information (the source you will be pulling data from, the destination you will be publishing data to and how the data should be fetched and formatted) is stored in a small metadata database which is managed by the Publisher.

Scheduler

Once you define the configuration of a Publishing job, the Scheduler component initializes a Worker responsible for running the job itself. A Kubernetes Pod is created for each active Publisher job configuration. The Scheduler destroys the pods once the job is completed or an error state is reached. This is what makes Publisher a true Kubernetes-native solution: dynamic and scalable.

The Publisher can either publish data constantly or publish increments of data on a given cron-based schedule (for more information on this, please view the Usage page).

During its periodical checking of Kubernetes pods status, the Scheduler component ensures that all the pods that should be active are, in fact active. If a pod breaks and crashes for some reason, the Scheduler will create it again if it is supposed to be running.

Worker

One Worker component is created for each active Publisher configuration. There can be multiple active workers with different configurations simultaneously.

Once a Worker is created, it processes data in a loop until a previously defined stopping point, until it’s stopped by the user, or until an error has occurred.

Web UI

The Web UI component is a visual tool used to apply the configuration files of the individual Publisher jobs and monitor the performance of Publisher instances.