Workloads #

As for now BenchPilot only supports the following containerized workloads:

Name	Description	Specific Configuration Parameters
marketing-campaign	A streaming distributed workload that features an application as a data processing pipeline with multiple and diverse steps that emulate insight extraction from marketing campaigns. The workload utilizes technologies such as Kafka and Redis.	campaigns, which is the number of campaigns, the default number is 1000. tuples_per_second, the number of emitted tuples per second, the default is 10000. kafka_event_count, the number of generated and published events on kafka, the default is 1000000. maximize_data, this attribute is used to automatically maximize the data that are critically affecting the workload’s performance, the input that the user can put is in the format of x10, x100, etc.
mlperf	An inference workload, that includes tasks such as image classification and object recognition.	dataset_folder, which is the dataset folder, the one we have been using for our experiments is the imagenet2012, model_file, the model file that the mlperf will use for inferencing the images, for e.g. resnet50_v1.pb profile, the mlperf’s profile, e.g. resnet50-tf data_volume_path, the path that will be used to volume the data from, this is done as to not avoid creating huge workload images, and due to having easier configuration
db	A nosql database workload that keeps executing CRUD operations (Create, Read, Update and Delete).	db, the database that will be used as the underlying once, for now, only mongodb is supported. threads, the number of threads to be used for executing the operations. The default number of threads is 1. record_count, the number of records that will be loaded as a starting dataset into the database. The default number is 2500000. operation_count, the number of operations that will be executed during the workload - this can affect the time that the experiment will need to finish. The default number is 2500000. read_proportion, this is a float number, that represents that percentage of read operations to be executed during the benchmark, the default is 0.5 update_proportion, this is a float number, that represents that percentage of update operations to be executed during the benchmark, the default is 0.5 scan_proportion, the percentage of read operations to be executed during the experiment. The default is 0. insert_proportion, the proportion of insert operations to be executed. The default is 0. request_distribution, the distribution of the pattern of data access. The default is zipfian.
simple	This workload represents simple stressors that target a specific resource to stress. Underneath uses the linux command stress or IPerf3.	service, which can be either iperf3 or stress options, here you need to define the options as you would insert them while using the iperf3 or stress command. For e.g. for iperf3, the options should be “-c” to set the client’s ip, or “-s” for the server ip, and “-p” for setting the targeted port. In case, of selecting stress, the options could be for example “–vm”: “12”, and “–vm-bytes”: “1024M”.

It’s important to note that BenchPilot can be easily extended to add new workloads.

For extending BenchPilot check this section out.

Detailed Workload Information #

Streaming Analytic Workload / Marketing-Campaign #

For this workload, we have employed the widely known Yahoo Streaming Benchmark, which is designed to simulate a data processing pipeline for extracting insights from marketing campaigns. the pipeline executed on the edge device includes steps such as receiving advertising traffic data, filtering the data, removing any unnecessary values, combining the data with existing information from a key-value store, and storing the final results. All data produced by a data generator is pushed and extracted through a message queue (Apache Kafka), while intermediate data and final results are stored in an in-memory database (Redis). This workload can be executed using any of the following distributed streaming processing engines: Apache Storm, Flink, Spark.

Collected Performance Metrics #

For evaluating the performance of this application, we extract the following measurements from the benchmarking log files:

Metric	Description
# of Tuples	The total number of tuples processed during execution
Latency	The total application latency, measured in ms, based on the statistics provided by the selected underlying processing engine for each deployed task

Distributed Processing Engine Parameters #

As mentioned, this workload can be executed using any of the following streaming distributed processing engines: Apache Storm, Flink or Spark.

For each of those engine, the user can alter/define the following attributes:

Engine	Storm	Flink	Spark
Parameters	partitions ackers	partitions buffer_timeout checkpoint_interval	partitions batchtime executor_cores executor_memory

Machine Learning Inference Workload / MLPerf #

We use MLPerf, a benchmark for machine learning training and inference, to assess the performance of our inference system. Currently, our focus is on two MLPerf tasks:

Image Classification: This task uses the ImageNet 2012 dataset (resized to 224x224) and measures Top-1 accuracy. MLPerf provides two model options: ResNet-50 v1.5, which excels in image classification, and RetinaNet, which is effective in object detection and bounding box prediction.
Object Detection: This task identifies and classifies objects within images, locating them with bounding boxes. MLPerf uses two model configurations: a smaller, 300x300 model for low-resolution tasks (e.g., mobile devices) and a larger, high-resolution model (1.44 MP). Performance is measured by mean average precision (mAP). The SSD model with a ResNet-34 backbone is the default for this task.

Additionally, we have extended MLPerf by adding network-serving capabilities to measure the impact of network overhead on inference. Our setup includes:

A lightweight server that loads models and provides a RESTful API.
A workload generator that streams images to the server one-by-one (“streaming mode”), contrasting with MLPerf’s standard local loading (“default mode”).

For this workload, it is possible to flexibly and easily configure the dataset, latency, batch size, workload duration, thread count, and inference framework (ONNX, NCNN, TensorFlow, or PyTorch).

Collected Performance Metrics #

For evaluating the performance of this application, we extract the following measurements from the benchmarking log files:

Metric	Description
Accuracy %	The model's accuracy that was measured during the benchmarking period
Average and/or Total Queries per Second	The # of queries that were executing during the experiment - each query represents the processing a batch of images
Mean Latency	The application's mean latency, measured in ms

NoSQL Database Workload #

Through the Yahoo! Cloud Serving Benchmark (YCSB) workload, one can evaluate NoSQL databases like MongoDB, Redis, Cassandra, and Elasticsearch under heavy load. YCSB tests basic operations—read, update, and insert—on each database using defined operation rates across an experiment’s duration. Currently, BenchPilot only supports MongoDB as the underlying database. However, it can be easily adapted to the rest of the databases by containerizing them.

Additionally, YCSB supports three workload distributions:

Zipfian: Prioritizes frequently accessed items.
Latest: Similar to Zipfian but focuses on recently inserted records.
Uniform: Accesses items randomly.

For this benchmark, users can adjust various parameters, including the number of records, total operations, load distribution, operation rate, and experiment duration. It also supports multiple threads for increased database load through asynchronous operations.

Collected Performance Metrics #

For evaluating the performance of this application, we can extract the following measurements from the log files:

Metric	Description
Count	Total number of operations per second for each minute of the experiment
Min	Min number of operations per second for each minute of the experiment
Max	Max number of operations per second for each minute of the experiment
Average	Average number of operations per second for each minute of the experiment