PipeGen: Automated Efficient Data Piping
- PipeGen is an automated tool that constructs efficient data transfer pipes by repurposing DBMS export/import paths for parallel, binary streaming transfers.
- Its architecture employs IORedirect to reroute file I/O and FormOpt to optimize string serialization, achieving speedups up to 3.8×.
- PipeGen minimizes disk I/O and serialization overhead, enabling rapid data movement across local and remote clusters in hybrid analytics environments.
PipeGen is an automated tool for constructing efficient data transfer “pipes” between database management systems (DBMSs) in hybrid analytics environments, targeting shared-nothing architectures. PipeGen leverages the bulk export and import paths already present in analytic engines—typically CSV or JSON-based text interfaces—and rewrites them to perform binary-efficient, parallel streaming transfers over sockets, fully bypassing disk I/O and minimizing serialization overhead. By extending DBMS import/export code paths with minimal intervention, PipeGen enables rapid inter-DBMS data movement with significant performance benefits, supporting both local and remote cluster scenarios (Haynes et al., 2016).
1. Motivation and Problem Statement
Hybrid analytics workflows commonly require moving large intermediate datasets between disjoint DBMSs. Traditional approaches rely on text-based export (e.g., CSV on disk), incurring high disk I/O cost, serialization/deserialization overhead, and temporary storage pressure. A 100 GB relation exported/imported via CSV often demands tens of minutes and hundreds of gigabytes of scratch space.
Alternatives—such as deploying specialized binary formats (Parquet, Arrow)—still require materialization on disk and lack zero-copy interoperation. Manually developing bespoke data pipes for each pair of systems is brittle, error-prone, and unsustainable as formats proliferate. PipeGen’s goal is to:
- Eliminate the disk I/O bottleneck in export–import chains.
- Systematically repurpose text-oriented export/import paths as specifications for the data pipe.
- Upgrade these paths to perform parallel, binary-efficient, streaming transfers.
- Limit code changes to concise stubs, preserving engine core logic.
2. Architecture and Design
PipeGen operates via a compile-time transformation with two principal components:
2.1 IORedirect
The IORedirect phase automatically locates all file I/O operations involved in bulk export/import—identified by instrumenting file-open system calls during unit tests and isolating filenames exclusive to these tests. These call sites are rewritten to instantiate DataPipeOutputStream or DataPipeInputStream, Java subclasses representing network socket streams. For example, a user-level command:
1 |
EXPORT TO 'db://target?workers=16' USING CSV; |
triggers the network-based pipe, opening sockets to “target” rather than writing files. Parallel coordination across workers is managed by a lightweight “worker directory,” where exporters and importers dynamically register and match via tuples.
2.2 FormOpt (Format Optimizer)
FormOpt analyzes unit tests covering export/import serialization, identifying string operations such as primitive serialization (e.g., Double.toString(v)) and parsing (Integer.parseInt(s)). It rewrites these applications to leverage a custom string class, AString, which stores raw primitive values. This optimization:
- Completely bypasses UTF-8 digit encoding and scanning.
- Eliminates delimiters (commas, tabs, field names) by inferring and suppressing redundant text separation.
- Pivots row-major dumps into column-wise blocks in-memory, utilizing Apache Arrow for columnar, zero-copy serialization.
The resulting throughput transitions from
to
enabling 3–4× bandwidth improvements due to reduced formatting and disk interaction.
3. Core Algorithms
PipeGen’s key transformations are encapsulated in pseudocode as follows:
3.1 File IO Redirection
1 2 3 4 5 6 7 8 9 10 11 12 |
for each engine under test: // Instrument every file-open record all system calls open(path) run export & import unit tests identify targetPaths := those paths seen only during export/import for each call site open(p) in engine source: if p ∈ targetPaths: replace with if (p.startsWith("db://")) then new DataPipeOutputStream(p) else new FileOutputStream(p) |
Runtime socket creation uses central directory matching for worker coordination:
1 2 3 4 5 6 7 |
class DataPipeOutputStream extends FileOutputStream { DataPipeOutputStream(String uri) { DirEntry e = WorkerDirectory.blockingQuery(uri, workerID); this.socket = new Socket(e.host, e.port); // wrap socket.getOutputStream() } } |
3.2 String Decoration (FormOpt mode)
1 2 3 4 5 6 7 8 |
for each unit test covering export/import: build a Data-Flow Graph (DFG) of String ops leading into the stream for each expr e in DFG: if e == literal("…") or e == v.toString(): replace with new AString(v) // carries raw v if e == Integer.parseInt(s): replace with AString.parseInt(s) |
At final export, PipeGen batches primitives into Arrow column buffers for efficient transfer.
4. Implementation and Supported Systems
PipeGen is implemented in Java and modifies classes such as FileOutputStream, FileInputStream, and a custom non-final AString replacing String. It also supports patches to StringBuilder, Hadoop’s Text, HDFS streaming classes, JDBC ResultSet.getString(), and Jackson’s JsonGenerator in “library-extension” mode.
Supported formats include:
- CSV (all engines)
- JSON (supported by Myria, Spark, Giraph, Hadoop)
- Parquet, Protobuf, and other binary formats via socket redirection if natively supported
The intermediate data representation is column-oriented Apache Arrow buffers enabling zero-copy semantics. Parallelism is maintained by strict 1:1 matching of exporters and importers through the worker directory, with EOF stubs assigned if importer count exceeds exporter count. Fault handling is implemented via compile-time verification proxies and dynamic, cluster-level debug modes comparing checksum outputs.
PipeGen has been successfully deployed to auto-generate pipes between five Java-based DBMSs: Myria, Spark 1.5.2, Giraph 1.0.0, Hadoop 2.7.1, and Derby 10.12, requiring only the execution of PipeGen scripts and a rebuild.
5. Experimental Evaluation
Evaluation utilized 16 Amazon EC2 m4.2xlarge nodes with YARN on 15 data nodes. Workloads transferred up to tuples with diverse schemas (e.g., ), including weighted-edge lists for Giraph.
Measured metrics include end-to-end data-transfer time (seconds), speedup over disk-based CSV, scaling with number of workers, sensitivity to data-type composition, and breakdown of optimization gains.
Speedup Summary Table
| Source\Target | Myria | Spark | Giraph | Hadoop | Derby |
|---|---|---|---|---|---|
| Myria | — | 3.2× | 3.1× | 3.3× | 2.9× |
| Spark | 3.1× | — | 3.4× | 3.2× | 3.0× |
| ... |
Aggregate average: 3.2× faster; peak speedup: 3.8×.
Parallel Scaling (Myria→Spark, )
| Workers | 1 | 4 | 8 | 16 |
|---|---|---|---|---|
| Speedup | 3.1 | 3.7 | 3.5 | 3.7 |
Data-Type Sensitivity
- Fixed-width numeric data: ~3.8× speedup
- Mixed strings: ~2.4× speedup
Optimization Breakdown (Myria↔Giraph)
- IORedirect only: ~1.7×
- Binary values: ~2.6×
- Delimiter removal: ~3.0×
- Column-pivot + Arrow: 3.4×
Intermediate Format Timing (Hadoop↔Spark, )
| Format | Time (s) |
|---|---|
| Custom binary row-major | 100 |
| Protobuf (static) | 98 |
| Protobuf (dynamic) | 102 |
| Arrow row-major | 75 |
| Arrow column-major | 65 |
Compression Tests (Myria→Giraph, )
- Uncompressed TCP: 22 s
- Run-Length Encoding (RLE): 24 s
- zlib: 20 s
- Shared-memory TCP: 28 s (local)
Code Modification Cost
| Engine | PipeGen Time | IORedirect LOC | FormOpt LOC |
|---|---|---|---|
| Hadoop | 245 s | 6 LOC, 3 cls | 36 LOC, 6 cls |
| Myria | 160 s | 8 LOC, 2 cls | 54 LOC, 5 cls |
| Giraph | 223 s | 9 LOC, 2 cls | 47 LOC, 4 cls |
| Spark | 187 s | 18 LOC, 5 cls | 38 LOC, 8 cls |
| Derby | 130 s | 5 LOC, 2 cls | 67 LOC, 2 cls |
6. Limitations and Future Directions
PipeGen operates under several constraints:
- Assumes prior agreement on schema layout (schema matching is out of scope).
- Relies on complete test coverage for export/import code paths—omitted logic disables FormOpt optimizations for correctness.
- Handles extra importers via EOF stubs; extra exporters are not yet supported.
- Optimizations for nested JSON are limited to top-level dictionaries.
- Prospective extensions include integration with federated planners, broadening to heterogeneous clusters and multi-data-center setups, adaptive real-time compression, support for non-Java engines, and applying the AString paradigm to binary file-based imports.
Periodic updates to PipeGen’s library extension allow supported engines to instantly benefit from advances in wire protocols (e.g., Apache Arrow release upgrades) without manual patching.
7. Significance in Hybrid Analytics
PipeGen establishes a robust solution for high-throughput, low-latency data shipping in multi-DBMS analytics pipelines, eliminating the scalability bottlenecks of disk I/O and format conversion. By automating the adaptation of DBMS export/import pathways to modern binary and streaming paradigms, PipeGen facilitates flexible, resource-efficient inter-system collaboration—demonstrating up to 3.8× transfer speedups with minimal manual engineering overhead. This framework provides foundational improvements for analytics practitioners requiring scalable, federated, and composable data workflows in heterogeneous processing environments (Haynes et al., 2016).