The Ops Community ⚙️: Avital Trifsik

Task scheduling with a message broker

Avital Trifsik — Thu, 17 Aug 2023 08:40:37 +0000

Introduction

Task scheduling is essential in modern applications to maximize resource utilization and user experience (Non-blocking task fulfillment).
A queue is a powerful tool that allows your application to manage and prioritize tasks in a structured, persistent, and scalable way.
While there are multiple possible solutions, working with a queue (which is also the perfect data structure for that type of work), can ensure that tasks are completed in their creation order without the risk of forgetting, overlooking, or double-processing critical tasks.

A very interesting story on the need and evolvement as the scale grows can be found in one of DigitalOcean’s co-founder’s blog post
From 15,000 database connections to under 100.

Any other solutions besides a queue?

Multiple. Each with its own advantages and disadvantages.

Cron
You can use cron job schedulers to automate such tasks. The issue with cron is that the job and its execution time have to be written explicitly and before the actual execution, making your architecture highly static and not event-driven. Mainly suitable for a well-defined and known set of tasks that either way have to take place, not by a user action.

Database
A database can be a good and simple choice for a task storing place, and actually used for that in the early days of a product MVP,
but there are multiple issues with that approach, for example:

Ordering of insertion is not guaranteed, and therefore the tasks handling might not take place in the order they actually got created.
Double processing can happen as the nature of a database is not to delete a record once read, so there is a potential of double reading and processing a specific task, and the results of that can be catastrophic to a system’s behavior.

Traditional queues

Often, for task scheduling, the chosen queue would probably be a pub/sub system like RabbitMQ.

Choosing RabbitMQ over a classic broker such as Kafka, for example, in the context of task scheduling does make sense as a more suitable tool for that type of task given the natural behavior of Kafka to retain records (or tasks) till a specific point in time, no matter if acknowledged or not.

The downside in choosing RabbitMQ would be the lack of scale, robustness, and performance, which in time become increasingly needed.

With that idea in mind, Memphis is a broker that presents scale, robustness, and high throughput alongside a type of retention that fully enables task scheduling over a message broker.

Memphis Broker is a perfect queue for task scheduling

On v1.2, Memphis released its support for ACK-based retention through Memphis Cloud. Read more here.

Messages will be removed from a station only when acknowledged by all the connected consumer groups. For example:

If we have only one connected consumer group when a message/record is acknowledged, it will be automatically removed from the station.
If we have two connected consumer groups, the message will be removed from the station (=queue) once all CGs acknowledge the message.

We mentioned earlier the advantages and disadvantages of using traditional queues such as RabbitMQ in comparison to common brokers such as Kafka in the context of task scheduling. When comparing both tools to Memphis, it’s all about getting the best from both worlds.

A few of Memphis.dev advantages –

Ordering
Exactly-once delivery guarantee
Highly scalable, serving data in high throughput with low 4. latency
Ack-based retention
Many-to-Many pattern

Getting started with Memphis Broker as a tasks queue

Sign up to Memphis Cloud.
Connect your task producer –
Producers are the entities that insert new records or tasks.
Consumers are the entities who read and process them.
A single client with a single connection object can act as both at the same time, meaning be both a producer and a consumer. Not to the same station because it will lead to an infinite loop. It’s doable, but not making much sense. That pattern is more to reduce footprint and needed “workers” so a single worker can produce tasks to a specific station, but can also act as a consumer or a processor to another station of a different use case. The below code example will create an Ack-based station and initiate a producer in node.js –

const { memphis } = require("memphis-dev");

(async function () {
  let memphisConnection;

  try {
    memphisConnection = await memphis.connect({
      host: "MEMPHIS_BROKER_HOSTNAME",
      username: "CLIENT_TYPE_USERNAME",
      password: "PASSWORD",
      accountId: ACCOUNT_ID
    });

    const station = await memphis.station({
      name: 'tasks',
      retentionType: memphis.retentionTypes.ACK_BASED,
    })

    const producer = await memphisConnection.producer({
      stationName: "tasks",
      producerName: "producer-1",
    });

    const headers = memphis.headers();
    headers.add("Some_KEY", "Some_VALUE");
    await producer.produce({
      message: {taskID: 123, task: "deploy a new instance"}, // you can also send JS object - {}
      headers: headers,
    });

    memphisConnection.close();
  } catch (ex) {
    console.log(ex);
    if (memphisConnection) memphisConnection.close();
  }
})();

Connect your task consumer –
The below consumer group will consume tasks, process them, and, once finished – acknowledge them. By acknowledging the tasks, the broker will make sure to remove those records to ensure exactly-once processing. We are using the station entity here as well in case the consumer starts before the producer. No need to worry. It is applied if the station does not exist yet.Another thing to remember is that a consumer group can contain multiple consumers to increase parallelism and read-throughput. Within each consumer group, only a single consumer will read and ack the specific message, not all the contained consumers. In case that pattern is needed, then multiple consumer groups are needed.

const { memphis } = require("memphis-dev");

(async function () {
  let memphisConnection;

  try {
    memphisConnection = await memphis.connect({
      host: "MEMPHIS_BROKER_HOSTNAME",
      username: "APPLICATION_TYPE_USERNAME",
      password: "PASSWORD",
      accountId: ACCOUNT_ID
    });

    const station = await memphis.station({
      name: 'tasks',
      retentionType: memphis.retentionTypes.ACK_BASED,
    })

    const consumer = await memphisConnection.consumer({
      stationName: "tasks",
      consumerName: "worker1",
      consumerGroup: "cg_workers",
    });

    consumer.setContext({ key: "value" });
    consumer.on("message", (message, context) => {
      console.log(message.getData().toString());
      message.ack();
      const headers = message.getHeaders();
    });

    consumer.on("error", (error) => {});
  } catch (ex) {
    console.log(ex);
    if (memphisConnection) memphisConnection.close();
  }
})();

If you liked the tutorial and want to learn what else you can do with Memphis Head here

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Shay Bratslavsky, Software Engineer at @Memphis.dev

Part 4: Validating CDC Messages with Schemaverse

Avital Trifsik — Thu, 22 Jun 2023 11:51:12 +0000

This is part four of a series of blog posts on building a modern event-driven system using Memphis.dev.

In the previous two blog posts (part 2 and part 3), we described how to implement a change data capture (CDC) pipeline for MongoDB using Debezium Server and Memphis.dev.

Schema on Write, Schema on Read

With relational databases, schemas are defined before any data are ingested. Only data that conforms to the schema can be inserted into the database. This is known as “schema on write.” This pattern ensures data integrity but can limit flexibility and the ability to evolve a system.

Predefined schemas are optional in NoSQL databases like MongoDB. MongoDB models collections of objects. In the most extreme case, collections can contain completely different types of objects such as cats, tanks, and books. More commonly, fields may only be present on a subset of objects or the value types may vary from one object to another. This flexibility makes it easier to evolve schemas over time and efficiently support objects with many optional fields.

Schema flexibility puts more onus on applications that read the data. Clients need to check for any desired field and confirm their data types. This pattern is called "schema on read."

Malformed Records Cause Crashes

In one of my positions earlier in my career, I worked on a team that developed and maintained data pipelines for an online ad recommendation system. One of the most common sources of downtime were malformed records. Pipeline code can fail if a field is missing, an unexpected value is encountered, or when trying to parse badly-formatted data. If the pipeline isn't developed with errors in mind (e.g., using defensive programming techniques, explicitly-defined data models, and validating data), the entire pipeline may crash and require manual intervention by an operator.

Unfortunately, malformed data, especially when handling large volumes of data, is a frequent occurrence. Simply hoping for the best won't lead to resilient pipelines. As the saying goes, "Hope for the best. Plan for the worst."

The Best of Both Worlds: Data Validation with Schemaverse

Fortunately, Memphis.dev has an awesome feature called Schemaverse. Schemaverse provides a mechanism to check messages for compliance with a specified schema and handle non-confirming messages.

To use Schemaverse, the operator needs to first define a schema. Messaged schemas can be defined using JSON Schema, Google Protocol Buffers, or GraphQL. The operator will choose the schema definition language appropriate to the format of the message payloads.

Once a schema is defined, the operator can "attach" the schema to a station. The schema will be downloaded by clients using the Memphis.dev client SDKs. The client SDK will validate each message before sending it to the Memphis broker. If a message doesn't validate, the client will redirect the message to the dead-letter queue, trigger a notification, and raise an exception to notify the user of the client.

In this example, we'll look at using Schemaverse to validate change data capture (CDC) events from MongoDB.

Review of the Solution

In our previous post, we described a change data capture (CDC) pipeline for a collection of todo items stored in MongoDB. Our solution consists of eight components:

Todo Item Generator: Inserts a randomly-generated todo item in the MongoDB collection every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
MongoDB: Configured with a single database containing a single collection (todo_items).
Debezium Server: Instance of Debezium Server configured with MongoDB source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice).
Printing Consumer: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console.
Transformer Service: A transformer service that consumes messages from the todo-cdc-events station, deserializes the MongoDB records, and pushes them to the cleaned-todo-cdc-events station.
Cleaned Printing Consumer: A second instance of the printing consumer that prints messages pushed to the cleaned-todo-cdc-events station.

In this iteration, we aren't adding or removing any of the components. Rather, we're just going to change Memphis.dev's configuration to perform schema validation on messages sent to the "cleaned-todo-cdc-events" station.

Schema for Todo Change Data Capture (CDC) Events

In part 3, we transformed the messages to hydrate a serialized JSON subdocument to produce fully deserialized JSON messages. The resulting message looked like so:

{
"schema" : ...,

"payload" : {
"before" : null,

"after" : {
"_id": { "$oid": "645fe9eaf4790c34c8fcc2ed" },
"creation_timestamp": { "$date": 1684007402978 },
"due_date": { "$date" : 1684266602978 },
"description": "buy milk",
"completed": false
},

...
}
}

Each JSON-encoded message has two top-level fields, "schema" and "payload." We are concerned with the "payload" field. The payload object has two required fields, "before" and "after", that we are concerned with. The before field contains a copy of the record before being modified (or null if it didn't exist), while the after field contains a copy of the record after being modified (or null if the record is being deleted).

From this example, we can define criteria that messages must satisfy to be considered valid. Let's write the criteria out as a set of rules:

The payload/before field may contain a todo object or null.
The payload/after field may contain a todo object or null.
A todo object must have five fields ("_id", "creation_timestamp", "due_date", "description", and "completed").
The creation_timestamp must be an object with a single field ("$date"). The "$date" field must have a positive integer value (Unix timestamp).
The due_date must be an object with a single field ("$date"). The "$date" field must have a positive integer value (Unix timestamp).
The description field should have a string value. Nulls are not allowed.
The completed field should have a boolean value. Nulls are not allowed.

For this project, we'll define the schema using JSON Schema. JSON Schema is a very powerful data modeling language. It supports defining required fields, field types (e.g., integers, strings, etc.), whether fields are nullable, field formats (e.g., date / times, email addresses), and field constraints (e.g., minimum or maximum values). Objects can be defined and referenced by name, allowing recursive schema and for definitions to be reused. Schema can be further combined using and, or, any, and not operators. As one might expect, this expressiveness comes with a cost: the JSON Schema definition language is complex, and unfortunately, covering it is beyond the scope of this tutorial.

Creating a Schema and Attaching it to a Station

Let's walk through the process of creating a schema and attaching it to a station. You'll first need to complete the first 10 steps from part 2 and part 3.

Step 11: Navigate to the Schemaverse Tab
Navigate to the Memphis UI in your browser. For example, you might be able to find it at https://localhost:9000/ . Once you are signed in, navigate to the Schemaverse tab:

Step 12: Create the Schema
Click the "Create from blank" button to create a new schema. Set the schema name to "todo-cdc-schema" and the schema type to "JSON schema." Paste the following JSON Schema document into the textbox on the right.

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "https://example.com/product.schema.json",
    "type" : "object",
    "properties" : {
        "payload" : {
            "type" : "object",
            "properties" : {
                "before" : {
                    "oneOf" : [{ "type" : "null" }, { "$ref" : "#/$defs/todoItem" }]
                },
                "after" : {
                    "oneOf" : [{ "type" : "null" }, { "$ref" : "#/$defs/todoItem" }]
                }
            },
            "required" : ["before", "after"]
        }
    },
    "required" : ["payload"],
   "$defs" : {
      "todoItem" : {
          "title": "TodoItem",
          "description": "An item in a todo checklist",
          "type" : "object",
          "properties" : {
              "_id" : {
                  "type" : "object",
                  "properties" : {
                      "$oid" : {
                          "type" : "string"
                      }
                  }
              },
              "description" : {
                  "type" : "string"
              },
              "creation_timestamp" : {
                  "type" : "object",
                  "properties" : {
                      "$date" : {
                          "type" : "integer"
                      }
                  }
              },
              "due_date" : {
                    "anyOf" : [
                        {
                            "type" : "object",
                            "properties" : {
                                "$date" : {
                                    "type" : "integer"
                                }
                            }
                        },
                        {
                            "type" : "null"
                        }
                    ]
              },
              "completed" : {
                  "type" : "boolean"
              }
          },
          "required" : ["_id", "description", "creation_timestamp", "completed"]
      }
  }
}

When done, your window should look like so:

When done, click the "Create schema" button. Once the schema has been created, you'll be returned to the Schemaverse tab. You should see an entry for the newly created schema like so:

Step 13: Attach the Schema to the Station
Once the schema is created, we want to attach the schema to the "cleaned-todo-cdc-events" station. Double-click on the "todo-cdc-schema" window to bring up its details window like so:

Next, click on the "+ Attach to Station" button. This will bring up the following window:

Select the "cleaned-todo-cdc-events" station, and click "Attach Selected." The producers attached to the station will automatically download the schema and begin validating outgoing messages within a few minutes.

Step 14: Confirm that Messages are Being Filtered
Navigate to the station overview page for the "cleaned-todo-cdc-events" station. After a couple of minutes, you should see a red warning notification icon next to the "Dead-letter" tab name.

If you click on the "Dead-letter" tab and then the "Schema violation" subtab, you'll see the messages that failed the schema validation. These messages have been re-routed to the dead letter queue so that they don't cause bugs in the downstream pipelines. The window will look like so:

Congratulations! You're now using Schemaverse to validate messages. This is one small but incredibly impactful step towards making your pipeline more reliable.

In case you missed parts 1,2 and 3:
Part 3: Transforming MongoDB CDC Event Messages

Part 2: Change Data Capture (CDC) for MongoDB with Debezium and Memphis.dev

Part 1: Integrating Debezium Server and Memphis.dev for Streaming Change Data Capture (CDC) Events

Originally published at Memphis.dev By RJ Nowling, Developer advocate at Memphis.dev

Part 3: Transforming MongoDB CDC Event Messages

Avital Trifsik — Tue, 06 Jun 2023 10:27:59 +0000

This is part three of a series of blog posts on building a modern event-driven system using Memphis.dev.

In our last blog post, we introduced a reference implementation for capturing change data capture (CDC) events from a MongoDB database using Debezium Server and Memphis.dev. At the end of the post we noted that MongoDB records are serialized as strings in Debezium CDC messages like so:

{
    "schema" : ...,

"payload" : {
"before" : null,

"after" : "{\\"_id\\": {\\"$oid\\": \\"645fe9eaf4790c34c8fcc2ed\\"},\\"creation_timestamp\\": {\\"$date\\": 1684007402978},\\"due_date\\": {\\"$date\\": 1684266602978},\\"description\\": \\"buy milk\\",\\"completed\\": false}",

...
}
}

We want to use the Schemaverse functionality of Memphis.dev to check messages against an expected schema. Messages that don’t match the schema are routed to a dead letter station so that they don’t impact downstream consumers. If this all sounds like ancient Greek, don’t worry! We’ll explain the details in our next blog post.

To use functionality like Schemaverse, we need to deserialize the MongoDB records as JSON documents. In this blog post, we describe a modification to our MongoDB CDC pipeline that adds a transformer service to deserialize the MongoDB records to JSON documents.

Overview of the Solution

The previous solution consisted of six components:

Todo Item Generator: Inserts a randomly-generated todo item in the MongoDB collection every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
MongoDB: Configured with a single database containing a single collection (todo_items).
Debezium Server: Instance of Debezium Server configured with MongoDB source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice).
Printing Consumer: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console.

In this iteration, we are adding two additional components:

Transformer Service: A transformer service that consumes messages from the todo-cdc-events station, deserializes the MongoDB records, and pushes them to the cleaned-todo-cdc-events station.
Cleaned Printing Consumer: A second instance of the printing consumer that prints messages pushed to the cleaned-todo-cdc-events station.

Our updated architecture looks like this:

A Deep Dive Into the Transformer Service

Skeleton of the Message Transformer Service

Our transformer service uses the Memphis.dev Python SDK. Let’s walk through the transformer implementation. The main() method of our transformer first connects to the Memphis.dev broker. The connection details are grabbed from environmental variables. The host, username, password, input station name, and output station name are passed using environmental variables in accordance with suggestions from the Twelve-Factor App manifesto.

async def main():
    try:
        print("Waiting on messages...")
        memphis = Memphis()
        await memphis.connect(host=os.environ[HOST_KEY],
                              username=os.environ[USERNAME_KEY],
                              password=os.environ[PASSWORD_KEY])

Once a connection is established, we create consumer and producer objects. In Memphis.dev, consumers and producers have names. These names appear in the Memphis.dev UI, offering transparency into the system operations.

print("Creating consumer")
        consumer = await memphis.consumer(station_name=os.environ[INPUT_STATION_KEY],
                                          consumer_name="transformer",
                                          consumer_group="")

        print("Creating producer")
        producer = await memphis.producer(station_name=os.environ[OUTPUT_STATION_KEY],
                                          producer_name="transformer")

The consumer API uses the callback function design pattern. When messages are pulled from the broker, the provided function is called with a list of messages as its argument.

  print("Creating handler")
        msg_handler = create_handler(producer)

        print("Setting handler")
        consumer.consume(msg_handler)

After setting up the callback, we kick off the asyncio event loop. At this point, the transformer service pauses and waits until messages are available to pull from the broker.

Keep your main thread alive so the consumer will keep receiving data

await asyncio.Event().wait()

Creating the Message Handler Function

The create function for the message handler takes a producer object and returns a callback function. Since the callback function only takes a single argument, we use the closure pattern to implicitly pass the producer to the msg_handler function when we create it.

The msg_handler function is passed three arguments when called: a list of messages, an error (if one occurred), and a context consisting of a dictionary. Our handler loops over the messages, calls the transform function on each, sends the messages to the second station using the producer, and acknowledges that the message has been processed. In Memphis.dev, messages are not marked off as delivered until the consumer acknowledges them. This prevents messages from being dropped if an error occurs during processing.

def create_handler(producer):
    async def msg_handler(msgs, error, context):
        try:
            for msg in msgs:
                transformed_msg = deserialize_mongodb_cdc_event(msg.get_data())
                await producer.produce(message=transformed_msg)
                await msg.ack()
        except (MemphisError, MemphisConnectError, MemphisHeaderError) as e:
            print(e)
            return

    return msg_handler

The Message Transformer Function

Now, we get to the meat of the service: the message transformer function. Message payloads (returned by the get_data() method) are stored as bytearray objects. We use the Python json library to deserialize the messages into a hierarchy of Python collections (list and dict) and primitive types (int, float, str, and None).

def deserialize_mongodb_cdc_event(input_msg):
    obj = json.loads(input_msg)

We expect the object to have a payload property with an object as the value. That object then has two properties (“before” and “after”) which are either None or strings containing serialized JSON objects. We use the JSON library again to deserialize and replace the strings with the objects.

 if "payload" in obj:
        payload = obj["payload"]

        if "before" in payload:
            before_payload = payload["before"]
            if before_payload is not None:
                payload["before"] = json.loads(before_payload)

        if "after" in payload:
            after_payload = payload["after"]
            if after_payload is not None:
                payload["after"] = json.loads(after_payload)

Lastly, we reserialize the entire JSON record and convert it back into a bytearray for transmission to the broker.

  output_s = json.dumps(obj)
    output_msg = bytearray(output_s, "utf-8")
    return output_msg

Hooray! Our objects now look like so:

{
"schema" : ...,

"payload" : {
"before" : null,

"after" : {
"_id": { "$oid": "645fe9eaf4790c34c8fcc2ed" },
"creation_timestamp": { "$date": 1684007402978 },
"due_date": { "$date" : 1684266602978 },
"description": "buy milk",
"completed": false
},

...
}
}

Running the Transformer Service

If you followed the 7 steps in the previous blog post, you only need to run three additional steps. to start the transformer service and verify that its working:

Step 8: Start the Transformer Service

$ docker compose up -d cdc-transformer
[+] Running 3/3
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1  Hea...                                                             0.5s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1           Healthy                                                            1.0s
 ⠿ Container cdc-transformer                                  Started                                                            1.3s

Step 9: Start the Second Printing Consumer

$ docker compose up -d cleaned-printing-consumer
[+] Running 3/3
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1  Hea...                                                             0.5s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1           Healthy                                                            1.0s
 ⠿ Container cleaned-printing-consumer                        Started                                                            1.3s

Step 10: Check the Memphis UI

When the transformer starts producing messages to Memphis.dev, a second station named "cleaned-todo-cdc-events" will be created. You should see this new station on the Station Overview page in the Memphis.dev UI like so:

The details page for the "cleaned-todo-cdc-events" page should show the transformer attached as a producer, the printing consumer, and the transformed messages:

Congratulations! We’re now ready to tackle validating messages using Schemaverse in our next blog post. Subscribe to our newsletter to stay tuned!

Head over to Part 4: Validating CDC Messages with Schemaverse to learn further.

An Introduction to Data Mesh

Avital Trifsik — Mon, 29 May 2023 05:41:22 +0000

As more and more teams have started to look for solutions that can help them unlock the full potential of their systems and people, decentralized architectures have started to become more and more popular. Whether it’s cryptocurrencies, microservices, or Git, decentralization has proven to be an effective method of dealing with centralized bottlenecks. Along the same lines, one approach to decentralizing control of data is using a data mesh. But what really is it, and how can it help? Let’s take a closer look at the concept and go over the data mesh architecture to better understand its benefits.

Data challenges in enterprises

It’s no secret that organizations have come quite a long way in their data journey. However, they still have their set of challenges that prevents them from leveraging the full benefits of data. These challenges include:

Trustworthiness
The traceability, quality, and observability of data demand robust implementation. It’s important to ask yourself a few important, difficult questions. These include:

Can you trust the data?
Is your data file complete?
Do you have the latest file?
Is your data source correct?

Agility
Change is the only thing that’s constant, and that’s true for large enterprises, too. It’s very difficult for data estates to keep up with these changes, which come in the way of enterprise agility. Take report generation, for instance – it takes weeks to do that, and that’s quite a big time frame in today’s fast-paced world.

Skills
To keep up with data, the entire workforce should have specialized skills. This is because maintaining data can become quite expensive and with a lack of skills, bottlenecks are bound to be very frequent.

Productivity
Productivity is another data challenge. Both business and data analysts spend up to 30-40% of their time looking for the correct dataset. Similarly, data engineers spend most of their time figuring out how to create a uniform dataset using disparate sources.

Ownership
Establishing dataset ownership is also a challenge. It’s hard to determine the owner and who can be trusted enough to declare the dataset trustworthy. In most cases, the team that owns the data platform takes ownership of the data, even though it might not understand it.

Discoverability
Only a few organizations have been able to leverage their data estate and set up a data marketplace where their consumers can explore different datasets and understand the ones they wish to use.

What is data mesh?

An overview of a data mesh (Source)

A data mesh can be best understood as a practice or concept used to manage a large amount of data spread across a decentralized or distributed network. It can also refer to a platform responsible for this function, or even both. As companies become increasingly dependent on their ability to store volumes of data and distribute it through data pipelines and leverage from it, it’s important to create an effective schema for using that data. This is where a data mesh comes in.

The idea behind a data mesh is that introducing more technology won’t help to solve the data challenges that companies face today. Instead, the only way to face those challenges is to reorganize the tools, processes, and people involved. A data mesh essentially creates a replicable method of managing different data sources across the company’s ecosystem and makes it more discoverable. At the same, it ensures consumers faster, more secure, and more efficient access to data.

A data mesh includes numerous benefits. These include:

Allows for decentralized data operations which improve business agility, scalability, and time-to-market.
Organizations that adopt the data mesh architecture prevent being locked into one data product or platform.
Adopts a self-service model that ensures easy access to a centralized infrastructure. This allows for faster SQL queries and data access.
Since it decentralizes data ownership, it ensures transparency across teams. (In comparison, centralized data ownership makes data teams heavily dependent upon).

Data mesh architecture components

The data mesh architecture involves four main components. Let’s go over them one by one.

4 data mesh principles (Source)

Decentralized data ownership
This architecture component mainly revolves around the people involved and calls for the remodeling of the monolith data structure by decentralizing analytical data and realigning its ownership from a central team to a domain team.

In a data mesh, a domain team that’s extremely familiar with the data asset is responsible for curating it, ensuring high-quality data administration and governance. In contrast, in a data warehouse antipattern, a generalist team is responsible for managing all the data of the organization and is usually focused on the technical aspect of the data warehouse instead of the quality of the data.

So organizations implementing a data mesh must define which data set is owned by which domain team. In addition to that, all the teams should be quick to make changes to maintain their mesh’s data quality. By making domain-centric accountability possible, decentralized data ownership solves many problems related to agility, ownership, and productivity.

For instance, organizations take a while to respond to the market since changes have to be made to many IT systems to make any business change. This is why unaligned priorities and poor coordination across the team hinder enterprise agility. Considering the rapid growth in data sources and the proliferating business use cases, central teams have become nothing more than bottlenecks. However, going from a monolithic architecture to domain-driven microservices has made operational systems more agile. And a data mesh can do the same for analytical data.

Data consumers usually spend their time finding the data owner, determining its traceability, and interpreting its meaning. As a result, the overall productivity of teams is reduced. However, decentralization brings both the analytical and operational world closer and establishes traceability, ownership, and a clear interpretation, thus improving the teams’ turnaround times.

And finally, ownership; in most cases, data owners aren’t known, making the IT teams responsible for the ETL the owners of the data. Central IT teams often act as intermediaries – they pass consumer requests to producers and aren’t considered owners because they don’t produce the data and neither do they understand it. Realigning the ownership of analytical data to the right domains can solve the problem since these domains are the producers of data and can understand it, too.

Data as a product
With domains identified and ownership established, the next step is to stop thinking of analytical data as an asset that must be stored and instead think of it as a product that must be served. Teams responsible for a data mesh publish data so that other teams, i.e., their internal customers, can benefit from it.

This is why domains need to stop considering analytical data as a by-product of business operations and think of it as a first-class product complete with dedicated owners that are responsible for its usability, discoverability, uptime, and quality and treat it just like any other business service. As such, they should also apply the different product development aspects to make it customer-focused, reliable, useful, and valuable. You can think of data products published by the teams responsible for a data mesh as microservices; the only difference is that data is on offer.

Thinking of data as a product solves problems related to productivity, agility, discoverability, and trustworthiness. The productivity of a data consumer automatically increases as trustworthiness, discoverability, and agility come into the equation. Let’s see how.

A data product is essentially an autonomous unit with its own release cycles and feature roadmap. This means that data teams don’t need to wait for a central team to provide some environment or data so that they can start working. In turn, establishing traceability and authenticity hardly takes time. Similarly, rework to align the SLOs (service level objectives) of the input dataset with that of the use-case takes relatively less time.

And with data ownership assigned to domains, the product owner (of the data) is responsible for the data product. This means that the product owner should make sure that the data product’s security, traceability, and quality are maintained and also reported via SLOs and the right metrics.

And finally, by thinking of data as a product, each product is self-explanatory and is advertised and cataloged on the organization’s data marketplaces. The relevant documentation outlines different usability topics and explains the relationship with other SLOs and data products. As a result, consumers get to enjoy full visibility of the data product, which, in turn, allows them to make a well-informed decision about its use.

Self-serve platform
Even though thinking of data as a product has numerous benefits, it might end up increasing the overall operation cost because it involves many small but highly skilled teams and numerous independent infrastructures. Plus, if these highly skilled teams aren’t properly optimized, the operating cost will go up further. This is where the third component of the data mesh architecture comes into play – a self-serve platform.

Although a data mesh revolves around the idea of decentralized data management, one of its most important aspects is a centralized location or a central data infrastructure that can facilitate the data product lifecycle, where all the members of the organization can easily find the datasets they require. This central infrastructure should support tenancy so that it facilitates autonomy. It should also be self-serve, and provide multiple out-of-the-box tools.

Historical, as well as real-time, should be available, and there should be some automated way of accessing data. While there are no plug-and-play tools that fulfill this principle, it can be accomplished via a wiki, a UI, or an API.

The important thing is that self-serve tools should be thoughtfully built and must reduce the cognitive load on the data product teams. They should also bring abstraction over the lower-level technical components to allow for data product standardization and faster development. Another important part of self-service is data product management, which includes removing, adding, and updating data products. Plus, management and entry should be as easy as possible to make usage easy.

Just like other components, the self-serve platform also solves a number of problems related to skills, cost of ownership, and agility. Since a self-serve platform takes away technical complexity, there’s less need for specialists and generalists are enough to serve the purpose. As a result, there’s no need to invest in a highly skilled team. The cost of ownership also reduces in terms of the infrastructure, since it’s centrally provisioned. And finally, autonomous data product teams can directly use the self-service platform; they don’t need to rely on the central infrastructure team to provide them with infrastructure resources and data. This speeds up the development cycle.

Federated computational governance
The three data mesh architecture principles discussed above solve most of the data challenges faced by organizations. However, since most data products operate across different domains, how can you harmonize data? The answer to this lies in the last architecture component: federated computational government, which is a big change from how traditional central governance is enforced. The former changes the way teams are organized and the way the infrastructure supports governance. In federated governance, a data product owner manages different aspects such as local access policies, data modeling, data quality, etc. This is a big shift from implementing canonical data to models to smaller ones specifically built to meet the needs of the data product.

Governance should be divided into two: local and global governance. The former is local to the data product, defines the local processes, frameworks, and governance policies, and is responsible for their implementation and adherence. This is a step away from central governing bodies that created policies and were responsible for validation and adherence.

Meanwhile, global governance involves a cross-functional body with experts in different specializations such as technology, legal, security, and infrastructure and is responsible for formulating policies. The local governing body is responsible for implementation as well as constant adherence.

To sum up, with federated governance applied to your data mesh, teams can always use data available to them from different domains.

All these four principles are important to implement a data mesh in an organization. Of course, the degree of implementation can differ, but each principle has its own benefits and overcomes the drawbacks of others. Just keep in mind that the bigger the mesh, the more value you can generate from the data.

Join 4500+ others and sign up for our data engineering newsletter

Originally published at memphis.dev by Sveta Gimpelson Co-founder & VP of Data & Research at Memphis.dev.

Part 2: Change Data Capture (CDC) for MongoDB with Debezium and Memphis.dev

Avital Trifsik — Sun, 28 May 2023 09:42:14 +0000

This is part two of a series of blog posts on building a modern event-driven system using Memphis.dev.

In our last blog post, we introduced a reference implementation for capturing change data capture (CDC) events from a PostgreSQL database using Debezium Server and Memphis.dev. By replacing Apache Kafka with Memphis.dev, the solution substantially reduced the operational resources and overhead – saving money and freeing developers to focus on building new functionality.

PostgreSQL is the only commonly used database, however. Debezium provides connectors for a range of databases, including the non-relational document database MongoDB. MongoDB is popular with developers, especially those working in dynamic programming languages since it avoids the object-relational impedance mismatch. Developers can directly store, query, and update objects in the database.

In this blog post, we demonstrate how to adapt the CDC solution to MongoDB.

Overview of the Solution

Here, we describe the architecture of the reference solution for delivering change data capture events with Memphis.dev. The architecture has not changed from our previous blog post except for the replacement of PostgreSQL with MongoDB.

A Todo Item generator script writes randomly-generated records to MongoDB. Debezium Server receives CDC events from MongoDB and forwards them to the Memphis REST gateway through the HTTP client sink. The Memphis REST gateway adds the messages to a station in Memphis.dev. Lastly, a consumer script polls Memphis.dev for new messages and prints them to the console.

Todo Item Generator: Inserts a randomly-generated todo item in the MongoDB collection every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
MongoDB: Configured with a single database containing a single collection (todo_items).
Debezium Server: Instance of Debezium Server configured with MongoDB source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice)
P*rinting Consumer*: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console

Getting Started

The implementation tutorial is available in the mongodb-debezium-cdc-example directory of the Memphis Example Solutions repository. Docker Compose will be needed to run it.

Running the Implementation
Build the Docker images for Debezium Server, the printing consumer, and database setup (table and user creation).

Currently, the implementation depends on a pre-release version of Debezium Server for the JWT authentication support. A Docker image will be built directly from the main branch of the Debezium and Debezium Server repositories. Note that this step can take quite a while (~20 minutes) to run. When Debezium Server 2.3.0 is released, we will switch to using the upstream Docker image.

Step 1: Build the Images

$ docker compose build --pull --no-cache

Step 2: Start the Memphis.dev Broker and REST Gateway
Start the Memphis.dev broker and REST gateway. Note that the memphis-rest-gateway service depends on the memphis broker service, so the broker service will be started as well.

$ docker compose up -d memphis-rest-gateway

[+] Running 4/4
 ⠿ Network mongodb-debezium-cdc-example_default                   Created                                                        0.0s
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1      Healthy                                                        6.0s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1               Healthy                                                       16.8s
 ⠿ Container mongodb-debezium-cdc-example-memphis-rest-gateway-1  Started

Step 3: Create a Station and Corresponding User in Memphis.dev
Messages are delivered to “stations” in Memphis.dev; they are equivalent to “topics” used by message brokers. Point your browser at http://localhost:9000/. Click the “sign in with root” link at the bottom of the page.

Follow the wizard to create a station named todo-cdc-events.

Create a user named todocdcservice with the same value for the password.

Click “next” until the wizard is finished:

Click “Go to station overview” to go to the station overview page.

Step 4: Start the Printing Consumer
We used the Memphis.dev Python SDK to create a consumer script that polls the todo-cdc-events station and prints the messages to the console.

$ docker compose up -d printing-consumer

[+] Running 3/3
 ⠿ Container mongodb-debezium-cdc-example-memphis-metadata-1  Hea...                                                             0.5s
 ⠿ Container mongodb-debezium-cdc-example-memphis-1           Healthy                                                            1.0s
 ⠿ Container printing-consumer                                Started                                                            1.4s

Step 5: Starting and Configuring MongoDB
To capture changes, MongoDB’s replication functionality must be enabled. There are several steps:

The replica set name must be set. This can be done by passing the name of a replica set on the command-line or in the configuration file. In the Docker Compose file, we run MongoDB with the command-line argument –replSet rs0 to set the replica set name.
When replication is used and authorization is enabled, a common key file must be provided to each replica instance. We generated a key file following the instructions in the MongoDB documentation. We then built an image that extends the official MongoDB image by including the key file.
The replica set needs to be initialized once MongoDB is running. We use a script that configures the instance on startup. The script calls the replSetInitiate command with a list of the IP addresses and ports of each MongoDB instance in the replica set. This command causes the MongoDB instances to communicate with each other and select a leader.

Generally speaking, replica sets are used for increased reliability (high availability). Most documentation that you’ll find describes how to set up a replica set with multiple MongoDB instances. In our case, Debezium’s MongoDB connector piggybacks off of the replication functionality to capture data change events. Although we go through the steps to configure a replica set, we only use one MongoDB instance.

The todo item generator script creates a new todo item every half second. The field values are randomly generated. The items are added to a MongoDB collection named “todo_items.”

In the Docker Compose file, the todo item generator script is configured to depend on the Mongodb instance running in a healthy state and successful completion of the database setup script. By starting the todo item generator script, Docker Compose will also start MongoDB and run the database setup script.

$ docker compose up -d todo-generator

[+] Running 3/3
 ⠿ Container mongodb                 Healthy                                                                                     8.4s
 ⠿ Container mongodb-database-setup  Exited                                                                                      8.8s
 ⠿ Container mongodb-todo-generator  Started                                                                                     9.1s

Step 6: Start the Debezium Server
The last service that needs to be started is the Debezium Server. The server is configured with a source connector for MongoDB and the HTTP Client sink connector through a Java properties file:

debezium.sink.type=http
debezium.sink.http.url=http://memphis-rest-gateway:4444/stations/todo-cdc-events/produce/single
debezium.sink.http.time-out.ms=500
debezium.sink.http.retries=3
debezium.sink.http.authentication.type=jwt
debezium.sink.http.authentication.jwt.username=todocdcservice
debezium.sink.http.authentication.jwt.password=todocdcservice
debezium.sink.http.authentication.jwt.url=http://memphis-rest-gateway:4444/
debezium.source.connector.class=io.debezium.connector.mongodb.MongoDbConnector
debezium.source.mongodb.connection.string=mongodb://db
debezium.source.mongodb.user=root
debezium.source.mongodb.password=mongodb
debezium.source.offset.storage.file.filename=data/offsets.dat
debezium.source.offset.flush.interval.ms=0
debezium.source.topic.prefix=tutorial
debezium.format.key=json
debezium.format.value=json
quarkus.log.console.json=false

Most of the options are self-explanatory. The HTTP client sink URL is worth explaining in detail. Memphis.dev REST gateway expects to receive POST requests with a path in the following format:
/stations/{station}/produce/{quantity}

The {station} placeholder is replaced with the name of the station to send the message to. The {quantity} placeholder is replaced with the value single (for a single message) or batch (for multiple messages).

The message(s) is (are) passed as the payload of the POST request. The REST gateway supports three message formats (plain text, JSON, or protocol buffer). The value (text/, application/json, application/x-protobuf) of the content-type header field determines how the payload is interpreted.

The Debezium Server’s HTTP Client sink produces REST requests that are consistent with these patterns. Requests use the POST verb, each request contains a single JSON-encoded message as the payload, and the content-type header set to application/json. We use todo-cdc-events as the station name and the single quantity value in the endpoint URL to route messages and indicate how the REST gateway should interpret the requests:

http://memphis-rest-gateway:4444/stations/todo-cdc-events/produce/single

The debezium.sink.http.authentication.type=jwt property indicates that the HTTP Client sink should use JWT authentication. The username and password properties are self-evident, but the debezium.sink.http.authentication.jwt.url property deserves some explanation. An initial token is acquired using the /auth/authenticate endpoint, while the authentication is refreshed using the separate /auth/refreshToken endpoint. The JWT authentication in the HTTP Client appends the appropriate endpoint to the given base URL.

Debezium Server can be started with the following command:

$ docker compose up -d debezium-server

Step 7: Confirm the System is Working
Check the todo-cdc-events station overview screen in Memphis.dev web UI to confirm that the producer and consumer are connected and messages are being delivered.

And, print the logs for the printing-consumer container:

message:

bytearray(b'{"schema":{"type":"struct","fields":[{"type":"string","optional":true,"name":"io.debezium.data.Json","version":1,"field":"before"},{"type":"string","optional":true,"name":"io.debezium.data.Json","version":1,"field":"after"},{"type":"struct","fields":[{"type":"array","items":{"type":"string","optional":false},"optional":true,"field":"removedFields"},{"type":"string","optional":true,"name":"io.debezium.data.Json","version":1,"field":"updatedFields"},{"type":"array","items":{"type":"struct","fields":[{"type":"string","optional":false,"field":"field"},{"type":"int32","optional":false,"field":"size"}],"optional":false,"name":"io.debezium.connector.mongodb.changestream.truncatedarray","version":1},"optional":true,"field":"truncatedArrays"}],"optional":true,"name":"io.debezium.connector.mongodb.changestream.updatedescription","version":1,"field":"updateDescription"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"rs"},{"type":"string","optional":false,"field":"collection"},{"type":"int32","optional":false,"field":"ord"},{"type":"string","optional":true,"field":"lsid"},{"type":"int64","optional":true,"field":"txnNumber"},{"type":"int64","optional":true,"field":"wallTime"}],"optional":false,"name":"io.debezium.connector.mongo.Source","field":"source"},{"type":"string","optional":true,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"name":"event.block","version":1,"field":"transaction"}],"optional":false,"name":"tutorial.todo_application.todo_items.Envelope"},"payload":{"before":null,"after":"{\\"_id\\": {\\"$oid\\": \\"645fe9eaf4790c34c8fcc2ec\\"},\\"creation_timestamp\\": {\\"$date\\": 1684007402475},\\"due_date\\": {\\"$date\\": 1684266602475},\\"description\\": \\"GMZVMKXVKOWIOEAVRYWR\\",\\"completed\\": false}","updateDescription":null,"source":{"version":"2.3.0-SNAPSHOT","connector":"mongodb","name":"tutorial","ts_ms":1684007402000,"snapshot":"false","db":"todo_application","sequence":null,"rs":"rs0","collection":"todo_items","ord":1,"lsid":null,"txnNumber":null,"wallTime":1684007402476},"op":"c","ts_ms":1684007402478,"transaction":null}}')

Format of the CDC Messages

The incoming messages are formatted as JSON. The messages have two top-level fields (schema and payload). The schema describes the record schema (field names and types), while the payload describes the change to the record. The payload object itself contains two fields (before and after) indicating the value of the record before and after the change.

For MongoDB, Debezium Server encodes the record as a string of serialized JSON:

{
"before" : null,

"after" : "{\\"_id\\": {\\"$oid\\": \\"645fe9eaf4790c34c8fcc2ed\\"},\\"creation_timestamp\\": {\\"$date\\": 1684007402978},\\"due_date\\": {\\"$date\\": 1684266602978},\\"description\\": \\"buy milk\\",\\"completed\\": false}"
}

This will have implications on the downstream processing of messages, which we will describe in a future blog post in this series.

Congratulations! You now have a working example of how to capture data change events from a MongoDB database using Debezium Server and transfer the events to Memphis.dev for downstream processing.

Head over to Part 3: Transforming MongoDB CDC Event Messages to learn further.

How to reduce your data traffic by 30% instantly

Avital Trifsik — Wed, 24 May 2023 14:52:11 +0000

Introduction

The bigger the traffic, the bigger the latency and the higher the cost.
It seems that the global economic situation grounded us a bit and made us go back to basics when every byte of memory counted and every unnecessary line of code was removed. Besides higher latency which usually drives pouring better computing and more costs, we tend to forget we're paying a huge amount of money for the amount of transferred data as well, especially around communication between services.

Let’s take a look at the following scenario-
A single, shared EC2 instance, 2 CPUs with 4GB RAM,
processing 10,000 requests of 1KB per day -> 300,000 on an average month -> 292 GB of transferred data.

If we run those minor numbers with an AWS EC2 calculator, we would get the following invoice:

Compute monthly cost is: $33.29 +
Data transfer cost:

If that 292 GBs is transferred within a region, it will cost $5.84 (14% of the total invoice)
If that 292 GBs is transferred back to the internet, it will cost $26.28 (44% of the total invoice)

Popular formats

Usually, services communicate with each other using one of the following

- JSON.

JSON stands for JavaScript Object Notation and is a text format for storing and transporting data.

When using JSON to send a message, we’ll first have to serialize the object representing the message into a JSON-formatted string, then transmit.

{"sensorId": 32,"sensorValue": 24.5}
This string is 36 characters long, but the information content of the string is only 6 characters long. This means that about 16% of transmitted data is actual data, while the rest is metadata. The ratio of useful data in the whole message is increased by decreasing key length or increasing value size, for example, when using a string or array.

- Protobuf.
Protocol Buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.

Protocol Buffers use a binary format to transfer messages.
Using Protocol Buffers in your code is slightly more complicated than using JSON.

The user must first define a message using the .proto file. This file is then compiled using Google’s protoc compiler, which generates source files that contain the Protocol Buffer implementation for the defined messages.

This is how our message would look in the .proto definition file:

message TemperatureSensorReading {
    optional uint32 sensor_id = 1;
    optional float sensor_value = 2;
}

When serializing the message from the example above, it’s only 7 bytes long. This can be confusing initially because we would expect uint32 and float to be 8 bytes long when combined. However, Protocol Buffers won’t use all 4 bytes for uint32 if they can encode the data in fewer bytes. In this example, the sensor_id value can be stored in 1 byte. It means that in this serialized message, 1 byte is metadata for the first field, and the field data itself is only 1 byte long. The remaining 5 bytes are metadata and data for the second field; 1 byte for metadata and 4 bytes for data because float always uses 4 bytes in Protocol Buffers. This gives us 5 bytes or 71% of actual data in a 7-byte message.

The main difference between the two is that JSON is just text, while Protocol Buffers are binary. This difference has a crucial impact on the size and performance of moving data between different devices.

Benchmark

In this benchmark, we will take the same message structure and examine the size difference, as well as the network performance.

We used Memphis schemaverse to act as our destination and a simple Python app as the sender.

The gaming industry use cases are usually considered to have a large payload, and to demonstrate the massive savings between the different formats. We used one of “Blizzard” example schemas.
The full used .proto can be found here.
Each packet weights 959.55KiB on average

As can be seen, the average savings are between 618.19% to 807.93%!!!

Another key aspect to take under consideration would be the additional step of Serialization/Deserialization.

One more key aspect to take under consideration is the serialization function and its potential impact on performance, as it is, in fact, a similar action to compression.

Quick side-note. Memphis Schemaverse eliminates the need to implement Serialization/Deserialization functions, but behind the scenes, a set of conversion functions will take place.

Going back to serialization. The tests were performed using a Macbook Pro M1 Pro and 16 GB RAM using Google.Protobuf.JsonFormatter and Google.Protobuf.JsonParser Python serialization/deserialization used google.protobuf.json_format

Summary

This comparison is not about articulating which is better. Both formats have their own strengths, but if we go from “the end” and ask ourselves what are the most important parameters for our use case, and both low latency and small footprint are there, then protobuf would be a suitable choice.
If the added complexity is a drawback, I highly recommend checking out Schemaverse and how it eliminates most of the heavy lifting when JSON is the used format, but the benefits of protobuf are appealing.

Resources
https://infinum.com/blog/json-vs-protocol-buffers/
https://levelup.gitconnected.com/protobuf-a-high-performance-data-interchange-format-64eaf7c82c0d

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at memphis.dev by Sveta Gimpelson Co-founder & VP of Data & Research at Memphis.dev.

Part 1: Integrating Debezium Server and Memphis.dev for Streaming Change Data Capture (CDC) Events

Avital Trifsik — Thu, 18 May 2023 08:17:12 +0000

This is part one of a series of blog posts on building a modern event-driven system using Memphis.dev.

Change data capture (CDC) is an architectural pattern which turns databases into sources for event-driven architectures. Frequently, CDC is implemented on top of built-in replication support. Changes to data (e.g., caused by INSERT, UPDATE, and DELETE statements) are recorded as atomic units and appended to a replication log for transmission to replicas. CDC software copies the events from the replica logs to streaming infrastructure for processing by downstream components.

So what do CDC events look like? In this tutorial, we’ll use the example of a table of todo items with the following fields:

A null value for the due date signifies that there is no due date.

If a user creates a todo item to buy milk from the store, the corresponding CDC event would look like:

{
“before” : null,

“after” : {
    “id” : 25,
    “description” : “buy milk”,
    “creation_timestamp” : “2023-05-01T16:32:15”,
    “due_date” : “2023-05-02”,
    “completed”: False }
}

If the user then completes (updates) the todo item, the following CDC event would be generated:

{
“before” : {
    “id” : 25,
    “description” : “buy milk”,
    “creation_timestamp” : “2023-05-01T16:32:15”,
    “due_date” : “2023-05-02”,
    “completed”: False },

“after” : {
    “id” : 25,
    “description” : “buy milk”,
    “creation_timestamp” : “2023-05-01T16:32:15”,
    “due_date” : “2023-05-02”,
    “completed”: True }
}

If the user deletes, the original item, the CDC event will look like so:

{
“before” : {
     “id” : 25,
     “description” : “buy milk”,
     “creation_timestamp” : “2023-05-01T16:32:15”,
     “due_date” : “2023-05-02”,
     “completed”: False },

“after” : null
}

The CDC approach is used to support various data analyses that are run in near real-time:

Copying data from one database to another. Modern systems often incorporate multiple storage solutions chosen because of their optimizations for complementary workloads. For example, online transaction processing (OLTP) databases like PostgreSQL are designed to support many concurrent users each performing queries that touch a small amount of data. Online analytical processing (OLAP) databases like Clickhouse are optimized to handle a small number of queries touching a large amount of data. The CDC approach doesn’t require schema changes (e.g., adding update timestamps) and is less resource intensive than approaches like running periodic queries to find new or changed records.
Performing real-time data integration. Some tasks require that data be pulled from multiple data sets and integrated. For example, a user’s clickstream (page view) events may be changed with details of the products they’re browsing to feed a machine learning model. Performing these joins in the production OLTP databases reduces application responsiveness. CDC allows computational heavy actions to be run on dedicated subsystems.
Performing aggregations or window analyses. An OLTP database may only log events such as commercial transactions. Business analysts may want to see the current sum of sales in a given quarter, however. The events captured through CDC can be aggregated in real-time to update dashboards and other data applications.
Performing de-aggregations. For performance reasons, an OLTP database may only store the current state of data like counters. For example, a database may store the number of likes or views of social media posts. Machine learning models often need individual events, however. CDC generates an event for every increase or decrease in the counters, effectively creating a historical time series for downstream analyses.

Implementing the CDC Pattern

Debezium is a popular open-source tool for facilitating CDC. Debezium provides connectors for various open-source and proprietary databases. The connectors listen for data change events and convert the internal representations to common formats such as JSON and Avro. Additionally, some support is provided for filtering and transforming events.

Debezium was originally designed as connectors that run in the Apache Kafka Connect framework. Apache Kafka, unfortunately, has a pretty large deployment footprint for production setups. It is recommended that a minimal production deployment has at least 3 nodes with 64 GB of RAM and 24 cores with storage configured with RAID 10 and at least 3 additional nodes for a separate Apache Zookeeper cluster. Meaning, a minimal production setup of Apache Kafka requires at least 6 nodes. Further, the JVM and operating system need to be tuned significantly to achieve optimal performance.

Many cloud-native systems are divided into microservices that are designed to scale independently. Rather than relying on one large message broker cluster, it’s common for these systems to deploy multiple, small, independent clusters. Memphis.dev is a next-generation, cloud-native message broker with a low resource footprint, minimal operational overhead, and no required performance tuning.

Debezium recently announced the general availability of Debezium Server, a framework for using Debezium connectors without Apache Kafka. Debezium Server runs in a standalone mode. Sink connectors for a wide range of messaging systems are included out of the box.

In this tutorial, we’ll demonstrate how to implement the CDC pattern for PostgreSQL using Debezium Server and Memphis.dev.

The Collaborative Power of Open Source: Interfacing Debezium Server and Memphis.dev

Memphis.dev and Debezium Server are integrated using REST. The Memphis.dev REST gateway provides endpoints for consuming messages, while Debezium Server provides the HTTP client sink for transmitting messages via REST. In our reference solution, Debezium Server makes a POST request to /station/todo-cdc-events/produce/single for each message. The REST interface accepts messages in JSON, text, and binary Protocol Buffer formats.

Unfortunately, we hit a stumbling block while implementing our CDC solution. The Memphis.dev REST gateway uses JSON Web Tokens (JWT) authentication for security, but Debezium Server’s HTTP client didn’t support it. Thanks to the collaborative power of open source, we were able to work with the Debezium developers to add JWT authentication functionality. The user must specify a username, password, and authentication endpoint URL in the Debezium Server configuration file. The server then tracks its authentication state and makes REST requests to perform an initial authorization and refresh that authorization as needed.

With the JWT authentication now in place, Debezium Server can forward CDC events to Memphis.dev. Further, all Debezium Server users, whether or not they are using Memphis.dev, can benefit from this functionality.

Overview of the Solution

Here, we describe a reference solution for delivering change data capture events with Memphis.dev. Debezium Server’s HTTP client sink is used to send the CDC events from a PostgreSQL database to a Memphis.dev instance using the Memphis REST gateway. Our solution has six components:

Todo Item Generator: Inserts a randomly-generated todo item in the PostgreSQL table every 0.5 seconds. Each todo item contains a description, creation timestamp, optional due date, and completion status.
PostgreSQL: Configured with a single database containing a single table (todo_items).
Debezium Server: Instance of Debezium Server configured with PostgreSQL source and HTTP Client sink connectors.
Memphis.dev REST Gateway: Uses the out-of-the-box configuration.
Memphis.dev: Configured with a single station (todo-cdc-events) and single user (todocdcservice).
Printing Consumer: A script that uses the Memphis.dev Python SDK to consume messages and print them to the console.

Running the Implementation

Code repository: Memphis Example Solutions.
Docker Compose will be needed.

Step 1: Build the Docker images for Debezium Server, the printing consumer, and database setup (table and user creation).

Currently, our implementation depends on a pre-release version of Debezium Server for the JWT authentication support. A Docker image will be built directly from the main branch of the Debezium and Debezium Server repositories. Note that this step can take quite a while (~20 minutes) to run. When Debezium Server 2.3.0 is released, we will switch to using the upstream Docker image.

$ docker compose build --pull --no-cache

[+] Building 0.0s (0/1)
[+] Building 0.2s (2/3)
 => [internal] load build definition from Dockerfile             0.0s
[+] Building 19.0s (5/10)
 => [internal] load build definition from Dockerfile             0.0s
[+] Building 19.2s (5/10)
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 302B                             0.0s
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-slim                           0.3s
 => CACHED [1/6] FROM docker.io/library/debian:bullseye-slim@sha256:9404b05bd09b57c76eccc0c5505b3c88b5feccac808d9b193a4fbac87bb                              0.0s
[+] Building 31.4s (5/10)
[+] Building 32.2s (5/10)
[+] Building 34.2s (5/10)
 => [internal] load .dockerignore                                0.0s
[+] Building 37.6s (11/11) FINISHED
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 302B                             0.0s
[+] Building 37.7s (5/10)
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 300B                             0.0s
[+] Building 37.9s (5/10)
 => [internal] load .dockerignore                                0.0s
[+] Building 38.0s (5/10)
[+] Building 38.2s (5/10)
[+] Building 18.9s (4/14)
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 613B                             0.0s
[+] Building 20.0s (4/14)
[+] Building 65.8s (11/11) FINISHED
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load build definition from Dockerfile             0.0s
[+] Building 1207.0s (15/15) FINISHED
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 613B                             0.0s
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-slim                           0.2s
 => CACHED [1/6] FROM docker.io/library/debian:bullseye-slim@sha256:9404b05bd09b57c76eccc0c5505b3c88b5feccac808d9b193a4fbac87bb                              0.0s
 => [ 2/13] RUN apt update && apt upgrade -y && apt install -y openjdk-11-jdk-headless wget git curl && rm -rf /var/cache/apt/ 49.5s
 => [ 3/13] RUN git clone https://github.com/debezium/debezium   6.0s
 => [ 4/13] WORKDIR /debezium                                    0.1s
 => [ 5/13] RUN ./mvnw clean install -DskipITs -DskipTests     761.4s
 => [ 6/13] RUN git clone https://github.com/debezium/debezium-server debezium-server-build                                            1.1s
 => [ 7/13] WORKDIR /debezium-server-build                       0.0s
 => [ 8/13] RUN ./mvnw package -DskipITs -DskipTests -Passembly372.1s
 => [ 9/13] RUN tar -xzvf debezium-server-dist/target/debezium-server-dist-*.tar.gz -C /   2.0s
 => [10/13] WORKDIR /debezium-server                             0.0s
 => [11/13] RUN mkdir data                                       0.5s
 => exporting to image                                          14.0s => => exporting layers                                          14.0s
 => => writing image sha256:51d987a3bf905f35be87ce649099e76c13277d75c4ac26972868fc9af2617d14                                                                0.0s
 => => naming to docker.io/library/debezium-server               0.0s
[+] Building 41.8s (11/11) FINISHED
 => [internal] load build definition from Dockerfile             0.0s
 => => transferring dockerfile: 302B                             0.0s
 => [internal] load .dockerignore                                0.0s
 => => transferring context: 2B                                  0.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-slim                           0.3s
 => [internal] load build context                                0.0s
 => => transferring context: 39B                                 0.0s
 => CACHED [1/6] FROM docker.io/library/debian:bullseye-slim@sha256:9404b05bd09b57c76eccc0c5505b3c88b5feccac808d9b193a4fbac87bb                              0.0s
 => [2/6] RUN apt update && apt upgrade -y && apt install -y python3 python3-pip && rm -rf /var/cache/apt/*                          33.5s
 => [3/6] WORKDIR /app                                           0.0s
 => [4/6] COPY todo_generator.py /app/                           0.0s
 => [5/6] RUN pip3 install -U pip wheel                          2.0s
 => [6/6] RUN pip3 install psycopg2-binary                       1.1s
 => exporting to image                                           4.9s
 => => exporting layers                                          4.9s
 => => writing image sha256:6424a08a9dedb77b798610a0b87c1c0a0c5f910039d03d673b3cf47ac54c10de                                                                0.0s
 => => naming to docker.io/library/todo-generator                0.0s

Step 2: Start the Memphis.dev broker and REST gateway. Note that the memphis-rest-gateway service depends on the memphis broker service, so the broker service will be started as well.

$ docker compose up -d memphis-rest-gateway

[+] Running 4/4
 ⠿ Network postgres-debezium-cdc-example_default                   Created                                                       0.1s
 ⠿ Container postgres-debezium-cdc-example-memphis-metadata-1      Healthy                                                       6.1s
 ⠿ Container postgres-debezium-cdc-example-memphis-1               Health...                                                    16.9s
 ⠿ Container postgres-debezium-cdc-example-memphis-rest-gateway-1  Started                                                      17.3s

Step 3: Follow the instructions for configuring Memphis.dev with a new station (todo-cdc-events) and user (todocdcservice) using the web UI.

Point your browser at http://localhost:9000/. Click the “sign in with root” link at the bottom of the page.

Follow the wizard to create a station named todo-cdc-events.

Create a user named todocdcservice with the same value for the password.

Click “next” until the wizard is finished:

Click “Go to station overview” to go to the station overview page.

Step 4: Start the printing consumer:

$ docker compose up -d printing-consumer

[+] Running 3/3
 ⠿ Container postgres-debezium-cdc-example-memphis-metadata-1  H...                                                              0.6s
 ⠿ Container postgres-debezium-cdc-example-memphis-1           Healthy                                                           1.1s
 ⠿ Container printing-consumer                                 Started

Start the todo item generator, PostgreSQL database, and Debezium Server:

$ docker compose up -d todo-generator

[+] Running 7/7
 ⠿ Container postgres                                              Healthy                                                       7.9s
 ⠿ Container postgres-debezium-cdc-example-memphis-metadata-1      Healthy                                                       0.7s
 ⠿ Container postgres-debezium-cdc-example-memphis-1               Health...                                                     1.2s
 ⠿ Container postgres-debezium-cdc-example-memphis-rest-gateway-1  Running                                                       0.0s
 ⠿ Container database-setup                                        Exited                                                        6.8s
 ⠿ Container debezium-server                                       Healthy                                                      12.7s
 ⠿ Container todo-generator                                        Started

Note that the todo item generator depends on the other services and will start them automatically. The database-setup container will run once to create the database, tables, and role in PostgreSQL.

Lastly, confirm the system is working. Check the todo-cdc-events station overview screen in the Memphis.dev web UI to confirm that the producer and consumer are connected and messages are being delivered.

And, print the logs for the printing-consumer container:

$ docker logs --tail 2 printing-consumer

message:  bytearray(b'{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"field":"item_id"},{"type":"string","optional":false,"field":"description"},{"type":"int64","optional":false,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"creation_date"},{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"due_date"},{"type":"boolean","optional":false,"field":"completed"}],"optional":true,"name":"tutorial.public.todo_items.Value","field":"before"},{"type":"struct","fields":[{"type":"int32","optional":false,"field":"item_id"},{"type":"string","optional":false,"field":"description"},{"type":"int64","optional":false,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"creation_date"},{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"due_date"},{"type":"boolean","optional":false,"field":"completed"}],"optional":true,"name":"tutorial.public.todo_items.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"schema"},{"type":"string","optional":false,"field":"table"},{"type":"int64","optional":true,"field":"txId"},{"type":"int64","optional":true,"field":"lsn"},{"type":"int64","optional":true,"field":"xmin"}],"optional":false,"name":"io.debezium.connector.postgresql.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"name":"event.block","version":1,"field":"transaction"}],"optional":false,"name":"tutorial.public.todo_items.Envelope","version":1},"payload":{"before":null,"after":{"item_id":205,"description":"ERJGCHXXOBBGSMOUQSMB","creation_date":1682991115063809,"due_date":null,"completed":false},"source":{"version":"2.3.0-SNAPSHOT","connector":"postgresql","name":"tutorial","ts_ms":1682991115065,"snapshot":"false","db":"todo_application","sequence":"[\\"26715784\\",\\"26715784\\"]","schema":"public","table":"todo_items","txId":945,"lsn":26715784,"xmin":null},"op":"c","ts_ms":1682991115377,"transaction":null}}')
message:  bytearray(b'{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"field":"item_id"},{"type":"string","optional":false,"field":"description"},{"type":"int64","optional":false,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"creation_date"},{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"due_date"},{"type":"boolean","optional":false,"field":"completed"}],"optional":true,"name":"tutorial.public.todo_items.Value","field":"before"},{"type":"struct","fields":[{"type":"int32","optional":false,"field":"item_id"},{"type":"string","optional":false,"field":"description"},{"type":"int64","optional":false,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"creation_date"},{"type":"int64","optional":true,"name":"io.debezium.time.MicroTimestamp","version":1,"field":"due_date"},{"type":"boolean","optional":false,"field":"completed"}],"optional":true,"name":"tutorial.public.todo_items.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"schema"},{"type":"string","optional":false,"field":"table"},{"type":"int64","optional":true,"field":"txId"},{"type":"int64","optional":true,"field":"lsn"},{"type":"int64","optional":true,"field":"xmin"}],"optional":false,"name":"io.debezium.connector.postgresql.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"name":"event.block","version":1,"field":"transaction"}],"optional":false,"name":"tutorial.public.todo_items.Envelope","version":1},"payload":{"before":null,"after":{"item_id":206,"description":"KXWQYXRWCGSKTBJOJFSX","creation_date":1682991115566896,"due_date":1683250315566896,"completed":false},"source":{"version":"2.3.0-SNAPSHOT","connector":"postgresql","name":"tutorial","ts_ms":1682991115568,"snapshot":"false","db":"todo_application","sequence":"[\\"26715992\\",\\"26715992\\"]","schema":"public","table":"todo_items","txId":946,"lsn":26715992,"xmin":null},"op":"c","ts_ms":1682991115885,"transaction":null}}')

Congratulations! You now have a working example of how to capture and transfer data change events from a PostgreSQL database into Memphis.dev using Debezium Server.

check out part 2:Change Data Capture (CDC) for MongoDB with Debezium and Memphis.dev
Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By RJ Nowling, Developer advocate at Memphis.dev

Memphis is now GA!

Avital Trifsik — Wed, 05 Apr 2023 05:48:32 +0000

Memphis is now GA,
and we do not take this title for granted.

Let's start from the beginning.

Struggling with the engineering part of the legacy brokers and queues planted the idea that it needs to be disrupted.
We dagged up more and realized a very rhetorical fact. Still, one that we usually forget – messaging queues and brokers are a means to an end, not the goal itself, and that understanding open a whole new variety of solutions.

The chosen solution was the most challenging one, but we believe, also the right one – a) It has to be open-sourced. b) It can’t be just an intelligent message broker on steroids. c) It has to offer what we call the “Day 2” operations on top to help build queue-based applications in minutes. From the more common ones, which are async communication between microservices, task scheduling, to event-driven applications, event sourcing, data ingestion, system integration, log collecting, and forming a data lake
With that understanding in mind, we formed the vision of Memphis – an intelligent and frictionless message broker that enables an ultra-fast development of queue-based applications for developers and data engineers.

From vision to GA.

Memphis beta version released on May 15th, 2022.
We focused on the foundations of the ecosystem, integrating with NATS internals, designing memphis to run natively on Kubernetes and cloud-native environments, out-of-the-box everything - from monitoring, dead-latter station, schema validation, real-time observability, and more.
With each release, the bug cycles became shorter and smaller, and we, as a team and product, became more intelligent by carefully understanding and listening to our users. By doing that, Memphis reached a solid and stable GA, and not less importantly, suitable for most developers and not just those who share our original challenges.
April 2nd, the GA release of the 1st part of Memphis.
Memphis GA stands for a solid, stable, and secure foundation for the future to come with zero known bugs and ready for production.

Some insights from the last eight months.

Avg time from installation to data ingestion takes 5 minutes.
We grew from 0 to over 5000 deployments.
50 new contributors.
Users report production usage before the GA release.
Use cases from async communication between microservices, event-driven applications, event sourcing, data ingestion, system.
integration, log collecting, security events, to forming a data lake
Schemaverse has been a game changer to many of our users.
The most used SDK is Go.
Cost and simplicity have been major factors in replacing existing tech with Memphis.

The future to come.

I mentioned 1st part, so there is a 2nd part.
Memphis’ 1st part is the storage layer, the message broker with all its benefits as we know it today, and will continue to evolve dramatically over the coming releases. We will also push hard on GitOps, automation enablement, and reconstructing some of the APIs so they can be modular and open for the community to self-implement new ones. Last but not least – multi-tenancy, partitions, read-replicas, and more.

Memphis’ 2nd part is all about helping developers and data engineers build valuable use cases and queue-based applications on top of Memphis. More on that in the coming weeks.

Memphis cloud is right around the corner, but if you prefer to self-host Memphis now - head here.

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Yaniv Ben Hemo, Co-Founder & CEO at Memphis.dev

Batch Processing vs Stream Processing

Avital Trifsik — Thu, 30 Mar 2023 11:51:21 +0000

In the digital age, data is the new currency and is being used everywhere.
From social media to IoT devices, businesses are generating more data than ever before.
With this data comes the challenge of processing it in a timely and efficient way.
Companies all over the world are investing in technologies that can help them better process, analyze, and use the data they are collecting to better serve their customers and stay ahead of their competitors.
One of the most important decisions organizations make when it comes to data processing is whether to use stream or batch processing. Stream processing is quickly becoming the go-to option for many companies because of its ability to provide real-time insights and immediate actionable results. With the right stream processing platform, companies can easily unlock the value of their data and use it to gain a competitive edge.This article will explore why stream processing is taking over, including its advantages over batch processing, such as its scalability, cost-effectiveness, and flexibility.

Let’s recap some of the basics first.

Data Processing

Data processing is the process of transforming raw data into meaningful and useful information. It involves a wide range of activities, including data collection, data cleaning, data integration, data analysis, and data visualization. It is an essential part of the analysis and decision-making process in many industries, including finance, healthcare, education, engineering, and business.

Data processing can be divided into two main categories: Manual Data Processing and Automated Data Processing.

Manual data processing involves the use of manual input,paper forms, manual calculations, and the entry of data into software programs. Manual data processing is often slow and error-prone, but it can be beneficial when dealing with large amounts of data or complex tasks. Automated data processing, however, is faster and more efficient than manual data processing. Automated data processing uses algorithms and software to automate the processing of data. This includes activities such as sorting, filtering, and summarizing data.

Data processing can also be classified into several types. These include batch processing, real-time processing, stream processing. Multi-Processing, and Time-sharing.

Batch Processing: This is a type of data processing that involves the execution of a series of pre-defined instructions or programs on a batch of data. It is typically used for tasks that require large amounts of data to be processed, such as data mining or data warehousing.
Real-time processing:This is a type of data processing that involves the continuous, real-time analysis of data streams. It is typically used for applications that require immediate analysis and response to incoming data, such as fraud detection and consumer/user activity.
Stream processing:This is a type of data processing that involves the continuous, real-time analysis of data streams. It is similar to real-time processing, but typically involves more complex operations and is capable of handling large volumes of data with low latency.
Multi-Processing: Multi-processing is a type of data processing that involves multiple processors working simultaneously on different tasks. Multi-processing is often used to speed up the processing of large amounts of data. By using multiple processors, the same task can be completed faster than if it were done on a single processor.
Time-sharing: Time-sharing is a type of data processing that allows multiple users to access the same computer or system at the same time. Time-sharing systems provide better efficiency and performance than batch processing and are often used in applications such as online banking, e-commerce, and web hosting.

Overall, data processing is an essential part of modern business and society, and is critical for turning raw data into useful information that can be used to make informed decisions and drive business growth.

Let’s discuss stream and batch data processing in detail.

What is Stream Processing?

Stream processing is a type of data processing that involves continuous, real-time analysis of data streams. It is a way of handling large volumes of data that are generated by various sources, such as sensors, financial transactions, or social media feeds, in real-time.

Advantages of Stream Processing

Real-Time Nature: One of the primary advantages of stream processing is its real-time nature. Because data is processed as it is received, stream processing allows for faster analysis and decision-making. This can be especially useful in applications where time is of the essence, such as in financial trading or emergency response.

Scalability: Another advantage of stream processing is its scalability. Because stream processing systems are designed to handle large volumes of data in real-time, they can easily scale to handle increases in data volume without compromising on performance. This makes them well-suited to applications that deal with large amounts of data, such as internet of things (IoT) applications or social media analysis.

Reduced Cost: Stream processing also helps organizations save money by reducing costs associated with storing large amounts of data. Stream processing systems can store only the data that is required for processing, eliminating the need to store and manage large datasets.

Security: Stream processing is more secure than traditional batch processing systems. Stream processing systems use encryption techniques to ensure that data is kept secure and confidential. This helps organizations to ensure that their data remains safe and secure.

Challenges of Stream Processing

Overall, stream processing is a powerful tool for handling large volumes of data in real-time, but it also comes with its own set of challenges.

One of the main challenges of stream processing is ensuring the accuracy and consistency of the data. Because stream processing involves continuous analysis of data in real-time, any errors or inconsistencies in the data can quickly propagate throughout the system, leading to incorrect results. This can be particularly problematic in complex systems with many different data sources, and can require careful design and management to ensure the quality of the data.

Another challenge of stream processing is dealing with late or out-of-order data. In stream processing, data is often generated by multiple sources, and it can arrive at different times or in a different order than expected. This can make it difficult to accurately process the data, and can require the use of specialized techniques to handle such situations. For example, some stream processing systems use techniques such as windowing or buffering to delay the processing of data until all necessary information is available, or to reorder data if it arrives out of sequence.

A third challenge associated with stream processing is maintaining the performance of the system. Because stream processing involves continuous analysis of data, it can put a heavy load on the underlying infrastructure, which can impact the overall performance of the system. This can be particularly problematic in systems with high volumes of data, or with complex data processing pipelines. To address this challenge, stream processing systems often use techniques such as parallelism, load balancing, and data partitioning to distribute the workload across multiple machines and improve the overall performance of the system.

The challenges stated above can be addressed through careful design and management of the system, as well as the use of specialized techniques to ensure the accuracy and performance of the data processing pipeline.

Use Cases

Stream processing can be used in many different use cases and can be applied to a variety of industries, including finance, retail, healthcare, telecommunications, and IoT.

Finance: Stream processing can be used to analyze market data in real time and detect fraud. By analyzing customer transactions and patterns, banks can quickly identify suspicious activity and alert authorities. This helps reduce the potential losses caused by fraudulent activities.

Retail: Stream processing can be used to provide customers with personalized offers and recommendations. By analyzing customer data in real time, retailers can create targeted campaigns that are tailored to each individual customer's preferences. This allows them to offer more relevant products and services, which can lead to increased customer satisfaction and loyalty.

Health Care: Stream processing can be used to monitor patient health in real time. By collecting data from various medical devices and sensors, healthcare providers can quickly identify any changes in a patient's health status. This can help them detect and treat conditions before they become serious and costly.

Telecommunication: Stream processing can be used to monitor network performance in real time. By analyzing data from various telecommunication networks, service providers can quickly identify any issues or outages and take corrective action. This helps them maintain a high level of service quality and provide reliable connections to their customers.

Internet of Things (IoT): Stream processing can also be used to collect and analyze data from connected devices. This can help organizations gain valuable insights into how their devices are performing and make informed decisions about how to optimize their operations.

What is Batch Data Processing?

Batch data processing is a method of executing a series of tasks in a predetermined sequence. It involves dividing a large amount of data into smaller, more manageable units called batches, which are processed independently and in parallel. In batch processing, a group of transactions or data is collected over a period of time and then processed all at once, typically overnight or during a maintenance window. Batch processing is often used in large-scale computing systems and data processing applications, such as payroll, invoicing, and inventory management.

Advantages of Batch Processing

There are several advantages to using batch processing:

Improved efficiency and speed: Batch processing allows for the concurrent execution of multiple jobs, which can significantly improve the speed and efficiency of processing large amounts of data. By processing multiple transactions or data sets at once, batch processing can reduce the amount of time it takes to complete a task, allowing organizations to complete more work in less time.

Reduced costs: Batch data processing can also help to reduce costs by reducing the need for manual intervention and labor. By automating repetitive tasks, organizations can reduce the amount of time and resources that are required to complete a task, leading to cost savings.

Increased accuracy: Batch processing can help to increase the accuracy of data processing by ensuring that all transactions are processed consistently, according to predefined rules and procedures. This can help to reduce the potential for errors and inconsistencies, leading to more accurate and reliable results.

Enhanced security: Batch processing can also help to improve the security of data processing by limiting the access to sensitive data to authorized personnel only. By controlling access to data and processing it in a secure environment, organizations can help to prevent unauthorized access and protect against potential security threats.

Improved scalability: Batch data processing is highly scalable, meaning that it can be easily adapted to handle increased volumes of data without a significant impact on performance. This allows organizations to easily and efficiently process large amounts of data as their needs evolve, without the need for additional resources or infrastructure.

Challenges of Batch Processing

There are also some challenges associated with batch processing.

One of the main challenges is the need for careful planning and coordination. Since batch processing is executed in a predetermined sequence, it is important to carefully plan and coordinate the execution of tasks to ensure that they are completed in the correct order.

Another challenge of batch processing is that it can be time-consuming. Since data is collected and processed in large quantities, it can take a significant amount of time to complete a batch. This can be especially problematic for businesses that need to process data in real-time, as batch processing may not be fast enough to keep up with the demands of the business.

Batch processing can also be more complex to implement and maintain, as it requires the development and management of batch schedules and processes. This can require additional resources and expertise, which can be a challenge for some organizations.

Another challenge of batch processing is the limited visibility it provides into the status of individual transactions or data items. With batch processing, it is often difficult to see the status of a particular transaction or data item within the batch, which can make it challenging to identify and address any issues that may arise.

Batch processing can also present challenges when it comes to maintaining data integrity. If a batch fails, it can be difficult to determine which data items were processed and which were not, which can lead to data loss or errors.

In addition, batch processing can be error-prone. Since data is catered in large quantities, it can be difficult to catch and correct errors in the batch. This can lead to inaccurate or incomplete results, which can be damaging to a business.

Use Cases:

Data analytics: Batch processing is used in data analytics to process large amounts of data and generate insights or reports. For example, a company might use batch processing to analyze customer data and generate reports on customer behavior or preferences.

ETL (extract, transform, load) processes: Batch processing is often used in ETL (extract, transform, load) processes to extract data from various sources, transform it into a format suitable for analysis or reporting, and load it into a data warehouse or other system.

Inventory Management: Batch processing is also used in inventory management systems to process orders, track inventory levels, and generate reports. By processing data in a batch, it is possible to more efficiently manage and track inventory levels.

Financial transactions: Batch processing is commonly used in the financial industry to process large numbers of transactions, such as credit card transactions or stock trades. For example, a bank might use batch processing to process transactions from multiple branches or ATM machines, and then update customer accounts accordingly.

Online services: Batch processing is also used in the development of online services, such as web applications or mobile apps. For example, a social media platform might use batch processing to process large amounts of data in order to generate recommendations for users or to generate reports on user behavior.

Batch Processing vs Stream Processing: An Overview

Batch Processing vs Stream Processing: Hardware
When it comes to hardware, there are some key differences between batch processing and stream processing. Batch data processing typically requires more powerful hardware, as it needs to be able to handle large amounts of data all at once. This can include powerful servers, high-capacity storage systems, and other specialized hardware.

On the other hand, stream processing typically requires less powerful hardware. Since data is processed in real-time, it does not need to be stored for later processing. This means that stream processing systems can be more lightweight and can use less powerful hardware.

Overall, the type of hardware needed for batch processing and stream processing depends on the specific requirements of the system and the amount of data being processed.

Batch Processing vs Stream Processing: Data Set
One of the main differences between batch processing and stream processing is the type of data they are designed to handle. Batch processing is typically used for data sets that are large and static, such as historical records or logs. In contrast, stream processing is typically used for data sets that are large but constantly changing, such as real-time sensor data.

Another important difference between batch processing and stream processing is the way they handle data. Batch processing systems typically operate on data that is stored in a database or file system. On the other hand, stream processing systems operate on data that is generated in real-time or near-real-time.

Batch Processing vs Stream Processing: Analysis
One more area where these two data processing methods differ is the type of analysis they are designed to perform. Batch processing systems are designed to perform complex, data-intensive analysis, such as machine learning and predictive modeling.

While, stream processing systems are suitable for performing simple, low-latency analyses, such as filtering and aggregation because it is designed to process data in small chunks, which limits their ability to perform complex analysis.

Batch Processing vs Stream Processing: Platforms
There are several platforms available for both batch processing and stream processing, each with its own unique features and capabilities.

Some of the most popular platforms for batch processing include:

Apache Hadoop and Apache Spark, which are open-source distributed computing platforms that are widely used for big data processing and analysis.

For stream processing, some popular platforms include:

Apache Flink and Apache Storm, which are also open-source distributed computing platforms. These platforms are often used for applications such as monitoring systems and real-time analytics.

In addition to these open-source platforms, there are also several commercial platforms available for both batch data processing and stream processing.

Some examples of commercial batch processing platforms include Cloudera and MapR, which are distributed computing platforms that are designed for big data processing and analysis.

Let’s put some light on these commercial platforms!

Cloudera: Cloudera is a leading provider of enterprise data cloud solutions, including software and services for data engineering, data warehousing, machine learning and analytics. Cloudera provides an enterprise data platform to customers of all sizes, enabling them to store, process and analyze their data quickly, reliably and securely. Cloudera also offers an array of professional services, such as consulting and training, to help customers get the most out of their data.

MapR: MapR is a distributed data platform for big data applications that provides fast and reliable access to data. It combines an optimized version of the Apache Hadoop open-source software with enterprise-grade features such as high availability, disaster recovery, and global replication. MapR also provides NoSQL databases, streaming analytics, and machine learning capabilities.

For stream processing, some popular commercial platforms include Confluent, Memphis, and Databricks, which are also distributed computing platforms. These platforms are often used for applications such as fraud detection and real-time recommendation engines.

Confluent: Confluent is an enterprise streaming platform built on Apache Kafka. It provides a range of services to support the development, deployment, and management of streaming data pipelines. It includes features such as real-time data integration, stream processing, and analytics. It also enables organizations to build mission-critical streaming applications.

Memphis: Memphis.dev is an open-source, real-time data processing platform

that provides end-to-end support for in-app streaming use cases using Memphis distributed message broker. Memphis' platform requires zero ops, enables rapid development, extreme cost reduction,eliminates coding barriers, and saves a great amount of dev time for data-oriented developers and data engineers.

Databricks: Databricks is a cloud-based platform for data engineering, machine learning, and analytics. It provides an integrated environment for working with big data that simplifies the process of managing and analyzing large datasets. It allows users to easily create data pipelines and complex analytics applications, and supports popular open source libraries such as Apache Spark, MLlib, and TensorFlow.

Overall, the choice of platform for batch processing or stream processing depends on the specific requirements of the application. Open-source platforms are often a good choice for applications that require flexibility and customization, while commercial platforms may be more suitable for applications that require support and scalability.

Why Batch is dying and Streaming Takes over?

There are several reasons why streaming has become more popular and why batch processing may be declining in popularity.

One reason is the increasing demand for real-time processing. In today's fast-paced world, many organizations require the ability to process data in real-time in order to respond to changing conditions and make timely decisions.

Another reason is the increasing availability of streaming technologies and tools. In the past, streaming was more difficult and expensive to implement, but today there are a wide range of tools and technologies available that make it easier and more cost-effective to implement streaming solutions. Also, with streaming it is possible to track the processing of data in real-time, which can be beneficial for debugging and monitoring purposes.

Conclusion

This blog post walks through the basics of stream and batch processing, lists some of the advantages and challenges associated with these data processing methods, and then also compares them in terms of performance, data sets, analysis, hardware, and some other features.

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Shoham Roditi Elimelech, software engineer at @Memphis.dev

Stateful stream processing with Memphis and Apache Spark

Avital Trifsik — Mon, 13 Mar 2023 16:10:10 +0000

Amazon Simple Storage Service (S3) is a highly scalable, durable, and secure object storage service offered by Amazon Web Services (AWS). S3 allows businesses to store and retrieve any amount of data from anywhere on the web by making use of its enterprise-level services. S3 is designed to be highly interoperable and integrates seamlessly with other Amazon Web Services (AWS) and third-party tools and technologies to process data stored in Amazon S3. One of which is Amazon EMR (Elastic MapReduce) which allows you to process large amounts of data using open-source tools such as Spark.

Apache Spark is an open-source distributed computing system used for large-scale data processing. Spark is built to enable speed and supports various data sources, including the Amazon S3. Spark provides an efficient way to process large amounts of data and perform complex computations in minimal time.

Memphis.dev is a next-generation alternative to traditional message brokers.
A simple, robust, and durable cloud-native message broker wrapped with an entire ecosystem that enables cost-effective, fast, and reliable development of modern queue-based use cases.

The common pattern of message brokers is to delete messages after passing the defined retention policy, like time/size/number of messages. Memphis offers a 2nd storage tier for longer, possibly infinite retention for stored messages. Each message that expels from the station will automatically migrate to the 2nd storage tier, which in that case is AWS S3.

In this tutorial, you will be guided through the process of setting up a Memphis station with a 2nd storage class connected to AWS S3. An environment on AWS. Followed by creating an S3 bucket, setting up an EMR cluster, installing and configuring Apache Spark on the cluster, preparing data in S3 for processing, processing data with Apache Spark, best practices, and performance tuning.

Setting up the Environment

Memphis

To get started, first install Memphis.
Enable AWS S3 integration via the Memphis integration center

Create a station (topic), and choose a retention policy. Each message passing the configured retention policy will be offloaded to an S3 bucket.

Check the newly configured AWS S3 integration as 2nd storage class by clicking “Connect”.

Start producing events into your newly created Memphis station.

Create an AWS S3 Bucket

If you haven't done so already, first you need to create an AWS account at https://aws.amazon.com/. Next, create an S3 bucket where you can store your data. You can use the AWS Management Console, the AWS CLI, or an SDK to create a bucket. For this tutorial, you will use the AWS management console at https://console.aws.amazon.com/s3/.

Click on "Create bucket".

Then proceed to create a bucket name complying with the naming convention and choose the region where you want the bucket to be located. Configure the “Object ownership” and “Block all public access” to your use case.

Make sure to configure other bucket permissions to allow your Spark application to access the data. Finally, click on the “Create bucket” button to create the bucket.

Setting up an EMR Cluster with Spark installed

The Amazon Elastic MapReduce (EMR) is a web service based on Apache Hadoop that allows users to cost-effectively process vast amounts of data using big data technologies including Apache Spark. To create an EMR cluster with Spark installed, open the EMR console at https://console.aws.amazon.com/emr/ and select "Clusters" under "EMR on EC2" on the left side of the page.

Click on "Create cluster" and give the cluster a descriptive name.
Under "Application bundle", select Spark to install it on your cluster.

Scroll down to the "Cluster logs" section and select the checkbox of Publish cluster-specific logs to Amazon S3.

This will create a prompt to enter the Amazon S3 location using the S3 bucket name you created in the previous step followed by /logs, ie., s3://myawsbucket/logs. /logs are required by Amazon to create a new folder in your bucket where Amazon EMR can copy the log files of your cluster.

Go to the “Security configuration and permissions section” and input your EC2 key pair or go with the option to create one.

Then click on the dropdown options for “Service role for Amazon EMR” and choose AWSServiceRoleForSupport. Choose the same dropdown option for “IAM role for instance profile”. Refresh the icon if need be to get these dropdown options

Finally, click the “Create cluster” button to launch the cluster and monitor the cluster status to validate that it’s been created.

Installing and configuring Apache Spark on EMR Cluster

After successfully creating an EMR cluster the next step will be to configure Apache Spark on the EMR Cluster. The EMR clusters provide a managed environment for running Spark applications on AWS infrastructure, making it easy to launch and manage Spark clusters in the cloud. It configures Spark to work with your data and processing needs and then submits Spark jobs to the cluster to process your data.

You can configure Apache Spark to the cluster with the Secure Shell (SSH) protocol. But first, you need to authorize the SSH security connections to your cluster which was set by default when you created the EMR cluster. A guide on how to authorize SSH connections can be found here.

To create an SSH connection, you need to specify the EC2 key pair that you selected when creating the cluster. Then connect to the EMR cluster using the Spark shell by first connecting the primary node. You first need to fetch the master public DNS of the primary node by navigating to the left of the AWS console, under EMR on EC2, choose Clusters, and then select the cluster of the public DNS name you want to get.

On your OS terminal, input the following command.

ssh hadoop@ec2-###-##-##-###.compute-1.amazonaws.com -i ~/mykeypair.pem

Replace the ec2-###-##-##-###.compute-1.amazonaws.com with the name of your master public DNS and the ~/mykeypair.pem with the file and path name of your .pem file (Follow this guide to get the .pem file). A prompt message will pop up to which your response should be yes. Type in exit to close the SSH command.

Preparing Data for Processing with Spark and uploading to S3 Bucket

Data processing requires preparation before uploading to present the data in a format that Spark can easily process. The format used is influenced by the type of data you have and the analysis you plan to perform. Some formats used include CSV, JSON, and Parquet.

Create a new Spark session and load your data into Spark using the relevant API. For instance, use the spark.read.csv() method to read CSV files into a Spark DataFrame.

Amazon EMR, a managed service for Hadoop ecosystem clusters, can be used to process data. It reduces the need to set up, tune, and maintain clusters. It also features other integrations with Amazon SageMaker, for example, to start a SageMaker model training job from a Spark pipeline in Amazon EMR.

Once your data is ready, using the DataFrame.write.format("s3") method, you can read a CSV file from the Amazon S3 bucket into a Spark DataFrame. You should have configured your AWS credentials and have write permissions to access the S3 bucket.

Indicate the S3 bucket and path where you want to save the data. For example, you can use the df.write.format("s3").save("s3://my-bucket/path/to/data")method to save the data to the specified S3 bucket.

Once your data is ready, using the DataFrame.write.format(“s3”) method, you can read a CSV file from the Amazon S3 bucket into a Spark DataFrame. You should have configured your AWS credentials and have written permissions to access the S3 bucket.

Indicate the S3 bucket and path where you want to save the data. For example, you can use the df.write.format(“s3”).save(“s3://my-bucket/path/to/data”) method to save the data to the specified S3 bucket.

Once the data is saved to the S3 bucket, you can access it from other Spark applications or tools, or you can download it for further analysis or processing. To upload the bucket, create a folder and choose the bucket you initially created. Choose the Actions button, and click on “Create Folder” in the drop-down items. You can now name the new folder.

To upload the data files to the bucket, select the name of the data folder.

In the Upload – Select “Files wizard” and choose Add Files.

Proceed with the Amazon S3 console direction to upload the files and select “Start Upload”.

It’s important to consider and ensure best practices for securing your data before uploading your data to the S3 bucket.

Understanding Data Formats and Schemas

Data formats and schemas are two related but completely different and important concepts in data management. Data format refers to the organization and structure of data within the database. There are various formats to store data, ie., CSV, JSON, XML, YAML, etc. These formats define how data should be structured alongside the different types of data and applications applicable to it. While data schemas are the structure of the database itself. It defines the layout of the database and ensures that data is stored appropriately. A database schema specifies the views, tables, indexes, types, and other elements. These concepts are important in analytics and the visualization of the database.

Cleaning and Preprocessing Data in S3

It is essential to double-check for errors in your data before processing it. To get started, access the data folder you saved the data file in your S3 bucket, and download it to your local machine. Next, you will load the data into the data processing tool which would be used to clean and preprocess the data. For this tutorial, the preprocessing tool used is Amazon Athena which helps to analyze unstructured and structured data stored in Amazon S3

Go to the Amazon Athena in AWS Console.

Click on “Create” to create a new table and then “CREATE TABLE”.

Type in the path of your data file in the part highlighted as LOCATION.

Go along with the prompts to define the schema for the data and save the table. Now, you can run a query to validate that the data is loaded correctly and then clean and preprocess the data
An example:
This query identifies the duplicates present in the data

SELECT row1, row2, COUNT(*)
FROM table
GROUP row, row2
HAVING COUNT(*) > 1;

This example creates a new table without the duplicates

CREATE TABLE new_table AS
SELECT DISTINCT *
FROM table;

Finally, export the cleaned data back to S3 by navigating to the S3 bucket and the folder to upload the file.

Understanding the Spark Framework

The Spark framework is an open-source, simple, and expressive cluster computing system which was built for rapid development. It is based on the Java programming language and serves as an alternative to other Java frameworks. The core feature of Spark is its in-memory data computing abilities which speed up the processing of large datasets.

Configuring Spark to work with S3

To configure Spark to work with S3 begin by adding the Hadoop AWS dependency to your Spark application. Do this by adding the following line to your build file (e.g. build.sbt for Scala or pom.xml for Java):

libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.3.1"

Input the AWS access key ID and secret access key in your Spark application by setting the following configuration properties:

spark.hadoop.fs.s3a.access.key <ACCESS_KEY_ID>
spark.hadoop.fs.s3a.secret.key <SECRET_ACCESS_KEY>

Set the following properties using the SparkConf object in your code:

val conf = new SparkConf()
  .set("spark.hadoop.fs.s3a.access.key", "<ACCESS_KEY_ID>")
  .set("spark.hadoop.fs.s3a.secret.key", "<SECRET_ACCESS_KEY>")

Set the S3 endpoint URL in your Spark application by setting the following configuration property:

spark.hadoop.fs.s3a.endpoint s3.<REGION>.amazonaws.com

Replace with the AWS region where your S3 bucket is located (e.g. us-east-1).
A DNS-compatible bucket name is required to grant the S3 client in Hadoop access for the S3 requests. If your bucket name contains dots or underscores, you may need to enable path style access for the sake of the S3 client in Hadoop which uses a virtual host style. Set the following configuration property to enable path access:

spark.hadoop.fs.s3a.path.style.access true

Lastly, create a Spark session with the S3 configuration by setting the spark.hadoop prefix in the Spark configuration:

val spark = SparkSession.builder()
  .appName("MyApp")
  .config("spark.hadoop.fs.s3a.access.key", "<ACCESS_KEY_ID>")
  .config("spark.hadoop.fs.s3a.secret.key", "<SECRET_ACCESS_KEY>")
  .config("spark.hadoop.fs.s3a.endpoint", "s3.<REGION>.amazonaws.com")
  .getOrCreate()

Replace the fields of** , , and ** with your AWS credentials and S3 region.

To read the data from S3 in Spark, the spark.read method will be used and then specify the S3 path to your data as the input source.

An example code demonstrating how to read a CSV file from S3 into a DataFrame in Spark:

val spark = SparkSession.builder()
  .appName("ReadDataFromS3")
  .getOrCreate()

val df = spark.read
  .option("header", "true") // Specify whether the first line is the header or not
  .option("inferSchema", "true") // Infer the schema automatically
  .csv("s3a://<BUCKET_NAME>/<FILE_PATH>")

In this example, replace with the name of your S3 bucket and with the path to your CSV file within the bucket.

Transforming Data with Spark

Transforming data with Spark typically refers to operations on data to clean, filter, aggregate, and join data. Spark makes available a rich set of APIs for data transformation, they include DataFrame, Dataset, and RDD APIs. Some of the common data transformation operations in Spark include filtering, selecting columns, aggregating data, joining data, and sorting data.

Here’s one example of data transformation operations:

Sorting data: This operation involves sorting data based on one or more columns. The orderBy or sort method on a DataFrame or Dataset is used to sort data based on one or more columns. For example

val sortedData = df.orderBy(col("age").desc)

Finally, you may need to write the result back to S3 to store the results.

Spark provides various APIs to write data to S3, such as DataFrameWriter, DatasetWriter, and RDD.saveAsTextFile.

The following is a code example demonstrating how to write a DataFrame to S3 in Parquet format:

val outputS3Path = "s3a://<BUCKET_NAME>/<OUTPUT_DIRECTORY>"

df.write
  .mode(SaveMode.Overwrite)
  .option("compression", "snappy")
  .parquet(outputS3Path)

Replace the input field of the with the name of your S3 bucket, and with the path to the output directory in the bucket.

The mode method specifies the write mode, which can be Overwrite, Append, Ignore, or ErrorIfExists. The option method can be used to specify various options for the output format, such as compression codec.

You can also write data to S3 in other formats, such as CSV, JSON, and Avro, by changing the output format and specifying the appropriate options.

Understanding Data Partitioning in Spark

In simple terms, data partitioning in spark refers to the splitting of the dataset into smaller, more manageable portions across the cluster. The purpose of this is to optimize performance, reduce scalability and ultimately improve database manageability. In Spark, data is processed in parallel on several clusters. This is made possible by Resilient Distributed Datasets (RDD) which are a collection of huge, complex data. By default, RDD is partitioned across various nodes due to their size.

To perform optimally, there are ways to configure Spark to make sure jobs are executed promptly and the resources are managed effectively. Some of these include caching, memory management, data serialization, and the use of mapPartitions() over map().

Spark UI is a web-based graphical user interface that provides comprehensive information about a Spark application’s performance and resource usage. It includes several pages such as Overview, Executors, Stages, and Tasks, that provide information about various aspects of a Spark job. Spark UI is an essential tool for monitoring and debugging Spark applications, as it helps identify performance bottlenecks, and resource constraints, and troubleshoot errors. By examining metrics such as the number of completed tasks, duration of the job, CPU and memory usage, and shuffle data written and read, users can optimize their Spark jobs and ensure they run efficiently.

Conclusion

In summary, processing your data on AWS S3 using Apache Spark is an effective and scalable way to analyze huge datasets. By utilizing the cloud-based storage and computing resources of AWS S3 and Apache Spark, users can process their data fast and effectively without having to worry about architecture management.

In this tutorial, we went through setting up an S3 bucket and Apache Spark cluster on AWS EMR, configuring Spark to work with AWS S3, and writing and running Spark applications to process data. We also covered data partitioning in Spark, Spark UI, and optimizing performance in Spark.

Reference:
For more depth into configuring spark for optimal performance, look here.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Sveta Gimpelson, Co-Founder & VP of Data & Research at @Memphis.dev

Stateful stream processing with Memphis and Apache Iceberg

Avital Trifsik — Thu, 09 Mar 2023 13:16:01 +0000

Amazon Web Services S3 (Simple Storage Service) is a fully managed cloud storage service designed to store and access any amount of data anywhere. It is an object-based storage system that enables data storage and retrieval while providing various features such as data security, high availability, and easy access. Its scalability, durability, and security make it popular with businesses of all sizes.

Apache Iceberg is an open-source tabular format for data warehousing that enables efficient and scalable data processing on cloud object stores, including AWS S3. It is designed to provide efficient query performance and optimize data storage while supporting ACID transactions and data versioning. The Iceberg format is optimized for cloud object storage, enabling fast query processing while minimizing storage costs.

Memphis is a next-generation alternative to traditional message brokers.
A simple, robust, and durable cloud-native message broker wrapped with an entire ecosystem that enables cost-effective, fast, and reliable development of modern queue-based use cases.

AWS S3 features

Scalability: AWS S3 is highly scalable and can store and retrieve any data, from a few gigabytes to petabytes.
Durability: S3 is designed to provide high durability, ensuring your data is always available and secure.
Safety: S3 offers various security features such as encryption and access control so you can protect your data.
Accessibility: S3 is designed for easy access, making it easy to store and access your data from anywhere in the world.
Cost efficient: S3 is designed to be a cost-effective solution with usage-based pricing and no upfront costs.

The purpose of processing data using Apache Iceberg is to optimize query performance and storage efficiency for large-scale data sets, while also providing a range of features to help manage and analyze data in the cloud. Here are some of the key benefits of using Apache Iceberg for data processing:

Efficient query performance: Apache Iceberg is designed to provide efficient query performance for large amounts of data by using partitioning and indexing to read only the data needed for a particular query. This enables faster and more accurate data processing, even for huge amounts of data.
Data versioning: Apache Iceberg supports data versioning, so you can store and manage multiple versions of your data in the same spreadsheet. This allows you to access historical data at any time and easily track changes over time.
ACID transactions: Apache Iceberg supports ACID transactions to ensure data consistency and accuracy at all times. This is especially important when working with mission-critical data, as it ensures that your data is always reliable and up-to-date.
Optimized data storage: Apache Iceberg optimizes data storage by only reading and writing the data needed for a given query. This helps to minimize storage costs and ensures that you’re only paying for the data that you’re actually using.
Flexibility: Apache Iceberg supports ACID transactions to ensure data consistency and accuracy at all times. This is especially important when working with mission-critical data, as it ensures that your data is always reliable and up-to-date.

Overall, the purpose of processing data with Apache Iceberg is to provide a more efficient and reliable solution for managing and processing large amounts of data in the cloud. With Apache Iceberg, you can optimize query performance, minimize storage costs, and ensure data consistency and freshness at all times.

Setting up Memphis

To get started, first install Memphis.
Enable AWS S3 integration via the Memphis integration center.

Create a station (topic), and choose a retention policy. Each message passing the configured retention policy will be offloaded to an S3 bucket.

Check the newly configured AWS S3 integration as 2nd storage class by clicking “Connect”.
Start producing events into your newly created Memphis station.

Setting up AWS S3 and Apache Iceberg

To get started, you’ll need an AWS S3 account Creating an AWS S3 account is a simple process. Here are the steps to follow:

Go to the AWS homepage (https://aws.amazon.com/) and click on the “Sign In to the Console” button in the top right corner.

If you already have an AWS account, enter your login details and click “Sign In”. If you don’t have an AWS account, click “Create a new AWS account” and follow the instructions to create a new account.
Once you’re logged into the AWS console, click on the “Services” dropdown menu in the top left corner and select “S3” from the “Storage” section.

You’ll be taken to the S3 dashboard, where you can create and manage your S3 buckets. To create a new bucket, click the “Create bucket” button.

Enter a unique name for your bucket (bucket names must be unique across all of AWS) and select the region where you want to store your data.
You can choose to configure additional settings for your bucket, such as versioning, encryption, and access control. Once you’ve configured your settings, click “Create bucket” to create your new bucket.

That’s it! You’ve now created an AWS S3 account and created your first S3 bucket. You can use this bucket to store and manage your data in the cloud and then you must have installed Apache Iceberg on your system. You can download Apache Iceberg from the official website, or you can install it using Apache Maven or Gradle.

Once you have Apache Iceberg installed, create an AWS S3 bucket where you can store your data. You can do this using the AWS S3 web console, or you can use the AWS CLI by running the following command:
“aws s3 mb s3://bucket-name” and replace “bucket-name” with the name of your bucket.

After creating the bucket, To create a table using Apache Iceberg, you can use the Iceberg Java API or the Iceberg CLI. Here’s an example of how to create a table using the Java API:

First, you need to add the Iceberg library to your project. You can do this by adding the following dependency to your build file (e.g. Maven, Gradle):

<dependency>
  <groupId>org.apache.iceberg</groupId>
  <artifactId>iceberg-core</artifactId>
  <version>0.11.0</version>
</dependency>

Create a Schema object that defines the columns and data types for your table:

Schema schema = new Schema(
    required(1, "id", Types.IntegerType.get()),
    required(2, "name", Types.StringType.get()),
    required(3, "age", Types.IntegerType.get()),
    required(4, "gender", Types.StringType.get())
);

In this example, the schema defines four columns: id (an integer), name (a string), age (an integer), and gender (a string).

Create a PartitionSpec object that defines how your data will be partitioned. This is optional, but it can improve query performance by allowing you to only read the data that’s relevant to a given query. Here’s an example:

PartitionSpec partitionSpec = PartitionSpec.builderFor(schema)
 .identity("gender")
 .bucket("age", 10)
 .build

In this example, we’re partitioning the data by gender and age. We’re using bucketing to group ages into 10 buckets, which will make queries for specific age ranges faster.

Create a Table object that represents your table. You’ll need to specify the name of your table and the location where the data will be stored (in this example, we’re using an S3 bucket):

Table table = new HadoopTables(new Configuration())
 .create(schema, partitionSpec, "s3://my-bucket/my-table");

This will create a new table with the specified schema and partitioning in the S3 bucket. You can now start adding data to the table and running queries against it.

Alternatively, you can use the Iceberg CLI to create a table. Here’s an example:

Open a terminal window and navigate to the directory where you want to create your table.
Run the following command to create a new table with the specified schema and partitioning:

iceberg table create \
 --schema "id:int,name:string,age:int,gender:string" \
 --partition-spec "gender:identity,age:bucket[10]" \
 s3://my-bucket/my-table

This will create a new table with the specified schema and partitioning in the S3 bucket. You can now start adding data to the table and running queries against it and that’s how you can create a table using Apache Iceberg.

Converting data to String or JSON

Converting data to strings or JSON is a common task when processing data with Apache Iceberg. This is useful for various reasons to prepare data for downstream applications, to export data to other systems, or simply to make it easier for humans to read. The method is as follows:

Identify the data to transform. Before converting data to string or JSON, you need to identify the data to convert. This can be the entire table, a subset of the table, or a single row. Once the data is identified, it can be converted to strings or JSON using the Apache Iceberg API. Convert data to string or JSON using Apache Iceberg API.

Expressions and row classes can be used to convert data to strings or JSON in Apache Iceberg. Here’s an example of how to convert a line to a JSON string.

Row row = table.newScan().limit(1).asRow().next();
String json = JsonUtil.toJson(row, table.schema());

This example scans a table and returns the first row as a Row object. Then use the JsonUtil class and the table’s schema to convert the row to a JSON string. You can use this approach to convert a single line to a JSON string or convert multiple lines to an array of JSON objects.

Here is an example of converting a table to a CSV string:

String csv = new CsvWriter(table.schema()).writeToString(table.newScan());

This example uses the CsvWriter class to convert the entire table to a CSV string. The CsvWriter class takes the table’s schema as a parameter and allows you to specify additional options such as delimiters and double quotes.
Save the transformed data to AWS S3. After converting the data to String or JSON, you can save it to AWS S3 using the HadoopFileIO class. Here’s an example of how to save a JSON string to an S3 bucket.

byte[] jsonBytes = json.getBytes(StandardCharsets.UTF_8);
HadoopFileIO fileIO = new HadoopFileIO(new Configuration());
try (OutputStream out = fileIO.create(new Path("s3://my-bucket/my-file.json"))) {
 out.write(jsonBytes); }

This example converts a JSON string to a byte array, creates a new HadoopFileIO object, and writes the byte array to an S3 file. You can use this approach to store any type of transformed data (CSV, TSV, etc.) in S3.

Flattening Schema

Schema flattening is the process of converting a nested schema into a flat schema with all columns at the same level. This helps facilitate data querying and analysis. To flatten the schema using Apache Iceberg:

Before simplifying the schema, you must identify the schema to simplify. This can be a schema that you have created yourself or a schema that is part of a larger dataset that you extract and analyze. Once you have identified a schema to simplify, you can simplify it using the Iceberg Java API. Here’s an example of how to simplify the schema.

Schema nestedSchema = new Schema(
  required(1, "id", Types.LongType.get()),
  required(2, "name", Types.StringType.get()),
  required(3, "address", Types.StructType.of(
 required(4, "street", Types.StringType.get()),
 required(5, "city", Types.StringType.get()),
 required(6, "state", Types.StringType.get()),
 required(7, "zip", Types.StringType.get())
  ))
);

Schema flattenedSchema = new Schema(
  required(1, "id", Types.LongType.get()),
  required(2, "name", Types.StringType.get()),
  required(3, "address_street", Types.StringType.get()),
  required(4, "address_city", Types.StringType.get()),
  required(5, "address_state", Types.StringType.get()),
  required(6, "address_zip", Types.StringType.get())
);

Transform flatten = new Transform(
  OperationType.FLATTEN,
  ImmutableMap.of(
 "address.street", "address_street",
 "address.city", "address_city",
 "address.state", "address_state",
 "address.zip", "address_zip"
  )
);

Schema newSchema = new Schema(flattenedSchema.columns());
newSchema = newSchema.updateMetadata(
  IcebergSchemaUtil.TRANSFORMS_PROP,
  TransformUtil.toTransformList(flatten).toString()
);

Table table = new HadoopTables(conf).create(
  newSchema, PartitionSpec.unpartitioned(), "s3://my-bucket/my-table"
);

In this example, we have a nested schema with a name and address field. We want to flatten the address field into separate columns for street, city, state, and zip.

To do this, we first create a new schema that represents the flattened schema. We then create a Transform that specifies how to flatten the original schema. In this case, we’re using the FLATTEN operation to create new columns with the specified names. We then create a new schema that includes the flattened columns and metadata that specifies the transformation that was applied. Once you’ve flattened the schema, you can save it to AWS S3 using the Table object that you created. Here’s an example:

table.updateSchema()
  .commit();

This will save the flattened schema to the S3 bucket that you specified when you created the table. That’s how you can flatten a schema using Apache Iceberg and save it to AWS S3.

In conclusion, processing and managing large amounts of data in AWS S3 can be challenging, especially when dealing with nested schemas and complex queries. Apache Iceberg provides a powerful and efficient solution to these challenges, giving users a scalable and cost-effective way to process and query large amounts of data. This tutorial showed how to use Apache Iceberg on AWS S3 to process and manage data. We’ve seen how to create tables, convert data to strings or JSON, and simplify schemas to make data more accessible. Armed with this knowledge, you can now use Apache Iceberg to process and manage large amounts of data on AWS S3.

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Idan Asulin, Co-Founder & CTO at Memphis.dev

Comparing Top 3 Schema Management Tools

Avital Trifsik — Wed, 01 Mar 2023 13:38:38 +0000

Introduction to schemas

Before deepening into the different supporting technologies, let’s create a baseline about schemas and message brokers or async server-server communication.

Schema = Struct.

The shape and format of a “message” are built and delivered between different applications/services/electronic entities.

Schemas can be found in SQL & No SQL databases, in different shapes of the data the database expects to receive (for example, first_name:string, first.name etc..).

An unfamiliar or noncompliant schema will result in a drop, and the database will not save the record.

Schemas can also be found when two logical entities are communicating, for example, two microservices.

Imagine A writes a message to B, which expects a specific format (like Protobuf), and its logic or code also expects specific keys and value types, as an example, typo in column name. Unexpected schema or different format will result in a consumer.

Schemas are a manual or have an automatic contract for stable communication that dictates how two entities should communicate.
The following compared technologies will help you maintain and enforce schemas between services as data flows from one service to another.

What is AWS Glue?

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

Credit

Capabilities

Data integration engine
Event-driven ETL
No-code ETL jobs
Data preparation

Main components of AWS Glue are the Data Catalog which stores metadata, and an ETL engine that can automatically generate Scala or Python code. Common data sources would be Amazon S3, RDS, and Aurora.

What is Confluent Schema Registry?

Confluent Schema Registry provides a serving layer for your metadata.
It provides a RESTful interface for storing and retrieving your Avro®, JSON Schema, and Protobuf schemas.

It stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows the evolution of schemas according to the configured compatibility settings and expanded support for these schema types.

It provides serializers that plug into Apache Kafka® clients that handle schema storage and retrieval for Kafka messages that are sent in any of the supported formats.

Credit

Schema Registry lives outside of and separately from your Kafka brokers.
Your producers and consumers still talk to Kafka to publish and read data (messages) to topics.

Concurrently, they can also talk to Schema Registry to send and retrieve schemas that describe the data models for the messages.

What is Memphis.dev Schemaverse?

Memphis Schemaverse provides a robust schema store and schema management layer on top of Memphis broker without a standalone compute unit or dedicated resources.

With a unique & modern UI and programmatic approach, technical and non-technical users can create and define different schemas, attach the schema to multiple stations and choose if the schema should be enforced or not.

Memphis’ low-code approach removes the serialization part as it is embedded within the producer library.

Schemaverse supports versioning, GitOps methodologies, and schema evolution.

Schemaverse’s main purpose is to act as an automatic gatekeeper and ensure the format and structure of ingested messages to a Memphis station and to reduce consumer crashes, as often happens if certain producers produce an event with an unfamiliar schema.

Current version common use cases

Schema enforcement between microservices
Data contracts
Convert events’ format
Create an organizational standard around the different consumers and producers

Comparison

Validation and Enforcement
When data streaming applications are integrated with schema management, schemas used for data production are validated against schemas within a central registry, allowing you to centrally control data quality.

AWS Glue offers enforcement and validation using Glue schema registry for Java-based applications using Apache Kafka, AWS MSK, Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.

Schema registry validates and enforces message schemas at both the client and server sides. Validation will take place on the client side, and by performing a serialization over the about-to-be-produced data by retrieving the schema from the schema registry.
Confluent provides read-to-use serialization functions that can be used.

Schema updates and evolution will require booting the client and fetching the updates, so as to change the schema at the registry level. It first required to be switched into a certain mode (forward/backward), perform the change, and then bring back to default.

Schemaverse validates and enforces the schema at the client level as well without the need for manual schema fetch, and supports runtime evolution, meaning clients don’t need a reboot to apply new schema changes, including different data formats.

Schemaverse also makes the serialization/deserialization transparent to the client and embeds it within the SDK based on the required data format.

Serialization/Deserialization

When sending data over the network it needs to be encoded into bytes before.
AWS Glue and Schema Registry works similarly. Each created schema has an ID.
When the application producing data has registered its schema, the Schema Registry serializer validates that the record being produced by the application is structured with the fields and data types matching a registered schema.

Deserialization will take place by a similar process by fetching the needed schema based on the given ID within the message.

In AWS Glue and Schema Registry, It is the client responsibility to implement and deal with the serialisation while in Schemaverse it is fully transparent and all is needed by the client is to produce a message that complies with the required structure.

Join 4500+ others and sign up for our data engineering newsletter.

Originally published at Memphis.dev By Yaniv Ben Hemo, Co-Founder & CEO at Memphis.dev