PipeGen: Automated Efficient Data Piping

Updated 3 January 2026

PipeGen is an automated tool that constructs efficient data transfer pipes by repurposing DBMS export/import paths for parallel, binary streaming transfers.
Its architecture employs IORedirect to reroute file I/O and FormOpt to optimize string serialization, achieving speedups up to 3.8×.
PipeGen minimizes disk I/O and serialization overhead, enabling rapid data movement across local and remote clusters in hybrid analytics environments.

PipeGen is an automated tool for constructing efficient data transfer “pipes” between database management systems (DBMSs) in hybrid analytics environments, targeting shared-nothing architectures. PipeGen leverages the bulk export and import paths already present in analytic engines—typically CSV or JSON-based text interfaces—and rewrites them to perform binary-efficient, parallel streaming transfers over sockets, fully bypassing disk I/O and minimizing serialization overhead. By extending DBMS import/export code paths with minimal intervention, PipeGen enables rapid inter-DBMS data movement with significant performance benefits, supporting both local and remote cluster scenarios (Haynes et al., 2016).

1. Motivation and Problem Statement

Hybrid analytics workflows commonly require moving large intermediate datasets between disjoint DBMSs. Traditional approaches rely on text-based export (e.g., CSV on disk), incurring high disk I/O cost, serialization/deserialization overhead, and temporary storage pressure. A 100 GB relation exported/imported via CSV often demands tens of minutes and hundreds of gigabytes of scratch space.

Alternatives—such as deploying specialized binary formats (Parquet, Arrow)—still require materialization on disk and lack zero-copy interoperation. Manually developing $O(n^2)$ bespoke data pipes for each pair of $n$ systems is brittle, error-prone, and unsustainable as formats proliferate. PipeGen’s goal is to:

Eliminate the disk I/O bottleneck in export–import chains.
Systematically repurpose text-oriented export/import paths as specifications for the data pipe.
Upgrade these paths to perform parallel, binary-efficient, streaming transfers.
Limit code changes to concise stubs, preserving engine core logic.

2. Architecture and Design

PipeGen operates via a compile-time transformation with two principal components:

2.1 IORedirect

The IORedirect phase automatically locates all file I/O operations involved in bulk export/import—identified by instrumenting file-open system calls during unit tests and isolating filenames exclusive to these tests. These call sites are rewritten to instantiate DataPipeOutputStream or DataPipeInputStream, Java subclasses representing network socket streams. For example, a user-level command:

1	EXPORT TO 'db://target?workers=16' USING CSV;

triggers the network-based pipe, opening sockets to “target” rather than writing files. Parallel coordination across workers is managed by a lightweight “worker directory,” where exporters and importers dynamically register and match via $(\text{host}, \text{port}, \text{queryID}, \text{workerID})$ tuples.

2.2 FormOpt (Format Optimizer)

FormOpt analyzes unit tests covering export/import serialization, identifying string operations such as primitive serialization (e.g., Double.toString(v)) and parsing (Integer.parseInt(s)). It rewrites these applications to leverage a custom string class, AString, which stores raw primitive values. This optimization:

Completely bypasses UTF-8 digit encoding and scanning.
Eliminates delimiters (commas, tabs, field names) by inferring and suppressing redundant text separation.
Pivots row-major dumps into column-wise blocks in-memory, utilizing Apache Arrow for columnar, zero-copy serialization.

The resulting throughput transitions from

$\text{throughput}_0 \approx (\text{serialize\_text}(t_i) + \text{write\_disk} + \text{read\_disk} + \text{parse\_text}) + \text{deserialize}(t_i)$

$\text{throughput}_p \approx \text{write\_socket}(\text{ArrowColumn}(t_i))$

enabling 3–4× bandwidth improvements due to reduced formatting and disk interaction.

3. Core Algorithms

PipeGen’s key transformations are encapsulated in pseudocode as follows:

3.1 File IO Redirection

for each engine under test:
  // Instrument every file-open
  record all system calls open(path)
  run export & import unit tests
  identify targetPaths := those paths seen only during export/import
for each call site open(p) in engine source:
  if p ∈ targetPaths:
    replace with
      if (p.startsWith("db://")) then
        new DataPipeOutputStream(p)
      else
        new FileOutputStream(p)

Runtime socket creation uses central directory matching for worker coordination:

class DataPipeOutputStream extends FileOutputStream {
  DataPipeOutputStream(String uri) {
    DirEntry e = WorkerDirectory.blockingQuery(uri, workerID);
    this.socket = new Socket(e.host, e.port);
    // wrap socket.getOutputStream()
  }
}

3.2 String Decoration (FormOpt mode)

for each unit test covering export/import:
  build a Data-Flow Graph (DFG) of String ops leading into the stream
  for each expr e in DFG:
    if e == literal("…") or e == v.toString():
      replace with new AString(v)
      // carries raw v
    if e == Integer.parseInt(s):
      replace with AString.parseInt(s)

At final export, PipeGen batches primitives into Arrow column buffers for efficient transfer.

4. Implementation and Supported Systems

PipeGen is implemented in Java and modifies classes such as FileOutputStream, FileInputStream, and a custom non-final AString replacing String. It also supports patches to StringBuilder, Hadoop’s Text, HDFS streaming classes, JDBC ResultSet.getString(), and Jackson’s JsonGenerator in “library-extension” mode.

Supported formats include:

CSV (all engines)
JSON (supported by Myria, Spark, Giraph, Hadoop)
Parquet, Protobuf, and other binary formats via socket redirection if natively supported

The intermediate data representation is column-oriented Apache Arrow buffers enabling zero-copy semantics. Parallelism is maintained by strict 1:1 matching of exporters and importers through the worker directory, with EOF stubs assigned if importer count exceeds exporter count. Fault handling is implemented via compile-time verification proxies and dynamic, cluster-level debug modes comparing checksum outputs.

PipeGen has been successfully deployed to auto-generate pipes between five Java-based DBMSs: Myria, Spark 1.5.2, Giraph 1.0.0, Hadoop 2.7.1, and Derby 10.12, requiring only the execution of PipeGen scripts and a rebuild.

5. Experimental Evaluation

Evaluation utilized 16 Amazon EC2 m4.2xlarge nodes with YARN on 15 data nodes. Workloads transferred up to $10^9$ tuples with diverse schemas (e.g., $(\text{key}: \text{int}, (\text{col}_1: \text{int}, \text{col}_2: \text{double}), \ldots)$ ), including weighted-edge lists for Giraph.

Measured metrics include end-to-end data-transfer time (seconds), speedup over disk-based CSV, scaling with number of workers, sensitivity to data-type composition, and breakdown of optimization gains.

Speedup Summary Table

Source\Target	Myria	Spark	Giraph	Hadoop	Derby
Myria	—	3.2×	3.1×	3.3×	2.9×
Spark	3.1×	—	3.4×	3.2×	3.0×
...

Aggregate average: 3.2× faster; peak speedup: 3.8×.

Parallel Scaling (Myria→Spark, $n=4\times 10^8$ )

Workers	1	4	8	16
Speedup	3.1	3.7	3.5	3.7

Data-Type Sensitivity

Fixed-width numeric data: ~3.8× speedup
Mixed strings: ~2.4× speedup

Optimization Breakdown (Myria↔Giraph)

IORedirect only: ~1.7×
- Binary values: ~2.6×
- Delimiter removal: ~3.0×
- Column-pivot + Arrow: 3.4×

Intermediate Format Timing (Hadoop↔Spark, $n=10^8$ )

Format	Time (s)
Custom binary row-major	100
Protobuf (static)	98
Protobuf (dynamic)	102
Arrow row-major	75
Arrow column-major	65

Compression Tests (Myria→Giraph, $n=5\times10^7$ )

Uncompressed TCP: 22 s
Run-Length Encoding (RLE): 24 s
zlib: 20 s
Shared-memory TCP: 28 s (local)

Code Modification Cost

Engine	PipeGen Time	IORedirect LOC	FormOpt LOC
Hadoop	245 s	6 LOC, 3 cls	36 LOC, 6 cls
Myria	160 s	8 LOC, 2 cls	54 LOC, 5 cls
Giraph	223 s	9 LOC, 2 cls	47 LOC, 4 cls
Spark	187 s	18 LOC, 5 cls	38 LOC, 8 cls
Derby	130 s	5 LOC, 2 cls	67 LOC, 2 cls

6. Limitations and Future Directions

PipeGen operates under several constraints:

Assumes prior agreement on schema layout (schema matching is out of scope).
Relies on complete test coverage for export/import code paths—omitted logic disables FormOpt optimizations for correctness.
Handles extra importers via EOF stubs; extra exporters are not yet supported.
Optimizations for nested JSON are limited to top-level dictionaries.
Prospective extensions include integration with federated planners, broadening to heterogeneous clusters and multi-data-center setups, adaptive real-time compression, support for non-Java engines, and applying the AString paradigm to binary file-based imports.

Periodic updates to PipeGen’s library extension allow supported engines to instantly benefit from advances in wire protocols (e.g., Apache Arrow release upgrades) without manual patching.

7. Significance in Hybrid Analytics

PipeGen establishes a robust solution for high-throughput, low-latency data shipping in multi-DBMS analytics pipelines, eliminating the scalability bottlenecks of disk I/O and format conversion. By automating the adaptation of DBMS export/import pathways to modern binary and streaming paradigms, PipeGen facilitates flexible, resource-efficient inter-system collaboration—demonstrating up to 3.8× transfer speedups with minimal manual engineering overhead. This framework provides foundational improvements for analytics practitioners requiring scalable, federated, and composable data workflows in heterogeneous processing environments (Haynes et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

PipeGen: Data Pipe Generator for Hybrid Analytics (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PipeGen.