From Legacy ETL to Cloud Data Fusion Pipelines: SQL vs. Idiomatic CDAP Approaches

Google Cloud Data Fusion is a fully managed, CDAP-based data integration service that lets teams build ETL/ELT pipelines visually or programmatically. For organizations migrating off Alteryx, Talend, IBM DataStage, Informatica, SSIS, or Oracle ODI, Data Fusion offers a natural landing zone — it’s a visual pipeline builder backed by a Dataproc Spark execution engine, with native BigQuery, GCS, Pub/Sub, and Cloud SQL connectors out of the box.

But the critical architectural question during migration is: should you push SQL to BigQuery through Data Fusion, or should you use Data Fusion's native CDAP plugins to express transformation logic? The answer depends on what the legacy ETL actually does, and MigryX's parsers make this decision automatically by classifying every transformation in the source system.

Two Approaches, One Pipeline

Cloud Data Fusion supports two fundamentally different patterns for data transformation, and understanding when to use each is key to producing pipelines that are both performant and maintainable.

Approach 1: SQL Pushdown to BigQuery

When legacy ETL logic is fundamentally SQL — SELECT statements with JOINs, GROUP BY, CASE expressions, window functions, MERGE for upserts — the most efficient Data Fusion pipeline pushes that SQL directly to BigQuery. Data Fusion becomes a thin orchestration and scheduling layer, while BigQuery's serverless engine does the actual computation.

This approach is ideal when migrating from SQL-heavy legacy platforms:

SAS PROC SQL — translates directly to BigQuery Standard SQL
Teradata BTEQ scripts — SQL transpiled from Teradata dialect to BigQuery syntax
Oracle PL/SQL procedures — procedural blocks decomposed into BigQuery scripting (DECLARE, IF/ELSE, LOOP, EXCEPTION handling)
Informatica SQL overrides — Source Qualifier and SQL Transformation logic moved to BigQuery
SSIS Execute SQL tasks — SQL statements extracted from .dtsx packages

# Data Fusion pipeline JSON — BigQuery SQL Pushdown node
{
  "name": "BigQueryPushdown",
  "plugin": {
    "name": "BigQueryPushDown",
    "type": "batchsource",
    "artifact": {"name": "google-cloud", "scope": "SYSTEM"},
    "properties": {
      "project": "my-project",
      "dataset": "staging",
      "sql": "SELECT o.order_id, o.amount, c.name, c.region,
              CASE WHEN o.amount >= 10000 THEN 'enterprise'
                   WHEN o.amount >= 1000  THEN 'mid_market'
                   ELSE 'smb' END AS deal_tier,
              SUM(o.amount) OVER (PARTITION BY c.region) AS region_total
       FROM `staging.orders` o
       JOIN `ref.customers` c ON o.customer_id = c.customer_id
       WHERE o.order_date >= '2025-01-01'",
      "enableQueryPushdown": "true"
    }
  }
}

The SQL pushdown approach has zero data movement for transformations. BigQuery executes the SQL in its serverless engine, and Data Fusion only orchestrates the execution. This means no Dataproc cluster is needed for SQL-only stages, dramatically reducing cost and latency.

Approach 2: Idiomatic CDAP Plugins

When legacy ETL involves visual dataflow logic that isn’t naturally expressed in SQL — multi-branch routing, row-level parsing, conditional splits, custom functions, complex type coercion, nested data flattening — MigryX generates native CDAP plugins that map 1:1 to the legacy transformation semantics.

This approach is ideal for visual ETL platforms:

Alteryx Designer workflows — multi-output tools, formula expressions, spatial operations, RegEx parsing
Talend Studio jobs — tMap lookup joins, tJavaRow custom code, tNormalize/tDenormalize, tLogRow
IBM DataStage parallel jobs — Transformer stages, Lookup stages, Aggregator stages, Merge stages
Informatica PowerCenter mappings — Expression transformations, Router, Sequence Generator, Update Strategy
ODI interfaces — Knowledge Modules with inline SQL and procedural code

CDAP Plugin Mapping from Legacy ETL

Each legacy transformation type maps to a specific Data Fusion plugin:

Legacy Transformation	Data Fusion Plugin	Notes
Alteryx Select / Filter	Wrangler (filter-rows directive)	Row filtering with expression support
Alteryx Formula	JavaScript Transform	Row-level calculations and derived columns
Alteryx Join / Append	Joiner plugin	Inner, outer, left, right join modes
Alteryx Summarize	GroupBy Aggregate	SUM, AVG, COUNT, MIN, MAX with group keys
Talend tMap	Joiner + JavaScript Transform	Lookup join + expression columns
Talend tNormalize	Wrangler (split-to-rows)	Delimiter-based row explosion
DataStage Transformer	JavaScript / Python Transform	Derivation expressions converted to JS/Python
DataStage Lookup	Joiner (broadcast mode)	Small reference table broadcast join
Informatica Router	Splitter plugin	Conditional routing to multiple output ports
Informatica Expression	Wrangler / JavaScript Transform	Row-level derived fields
Informatica Sequence Generator	JavaScript Transform	Counter logic in custom transform
SSIS Derived Column	Wrangler (set-column directive)	Expression-based column creation
SSIS Conditional Split	Splitter plugin	Condition-based row routing
ODI Interface (IKM SQL)	BigQuery Pushdown	SQL-based KMs go to pushdown mode
ODI Interface (IKM Procedure)	JavaScript / Python Transform	Procedural KMs become custom plugins

Example: Alteryx Workflow to Data Fusion Pipeline

Consider an Alteryx workflow with an Input Data tool reading from CSV, a Formula tool adding derived columns, a Filter tool removing invalid rows, a Join tool enriching with a reference table, and a Summarize tool aggregating results before writing to a database output.

# Data Fusion pipeline JSON — Idiomatic CDAP conversion of Alteryx workflow
{
  "name": "alteryx_orders_pipeline",
  "description": "Converted from Alteryx workflow: daily_order_summary.yxmd",
  "stages": [
    {
      "name": "RawOrders",
      "plugin": {"name": "GCSFile", "type": "batchsource"},
      "properties": {
        "path": "gs://raw-data/orders/",
        "format": "csv",
        "schema": "{\"type\":\"record\",\"fields\":[...]}"
      }
    },
    {
      "name": "CleanAndDerive",
      "plugin": {"name": "Wrangler", "type": "transform"},
      "properties": {
        "directives": [
          "filter-rows-on condition-false amount > 0",
          "set-column deal_tier ifelse(amount >= 10000, 'enterprise', ifelse(amount >= 1000, 'mid_market', 'smb'))",
          "set-column order_month format-date(order_date, 'yyyy-MM')"
        ]
      }
    },
    {
      "name": "ProductLookup",
      "plugin": {"name": "Joiner", "type": "batchjoiner"},
      "properties": {
        "joinKeys": "CleanAndDerive.product_id = Products.product_id",
        "selectedFields": "CleanAndDerive.*, Products.product_name, Products.category",
        "requiredInputs": "CleanAndDerive"
      }
    },
    {
      "name": "RegionSummary",
      "plugin": {"name": "GroupByAggregate", "type": "batchaggregator"},
      "properties": {
        "groupByFields": "region, deal_tier, order_month",
        "aggregates": "total_amount:amount:Sum, order_count:order_id:Count, avg_amount:amount:Avg"
      }
    },
    {
      "name": "BigQuerySink",
      "plugin": {"name": "BigQueryTable", "type": "batchsink"},
      "properties": {
        "project": "my-project",
        "dataset": "gold",
        "table": "regional_order_summary",
        "operation": "INSERT",
        "truncateTable": "true"
      }
    }
  ],
  "connections": [
    {"from": "RawOrders", "to": "CleanAndDerive"},
    {"from": "CleanAndDerive", "to": "ProductLookup"},
    {"from": "ProductLookup", "to": "RegionSummary"},
    {"from": "RegionSummary", "to": "BigQuerySink"}
  ]
}

Informatica to Cloud Data Fusion migration — automated end-to-end by MigryX

Hybrid Pipelines: SQL + CDAP Together

Real-world legacy ETL rarely falls cleanly into one category. A single Informatica workflow might have a Source Qualifier with SQL overrides (SQL-expressible), followed by an Expression transformation (could be either), followed by a Router with complex conditions (CDAP plugin), writing to multiple targets with Update Strategy logic (SQL MERGE). MigryX handles this by generating hybrid pipelines.

# Hybrid pipeline — SQL pushdown for extraction, CDAP for routing, SQL for loading
{
  "stages": [
    {
      "name": "ExtractWithSQL",
      "plugin": {"name": "BigQueryPushDown", "type": "batchsource"},
      "properties": {
        "sql": "SELECT * FROM `staging.transactions` WHERE txn_date >= CURRENT_DATE() - 7"
      }
    },
    {
      "name": "ClassifyAndRoute",
      "plugin": {"name": "JavaScriptTransform", "type": "transform"},
      "properties": {
        "script": "function transform(input, emitter, context) {
          if (input.amount >= 50000) { emitter.emit(input, 'high_value'); }
          else if (input.amount >= 5000) { emitter.emit(input, 'standard'); }
          else { emitter.emit(input, 'micro'); }
        }",
        "outputPorts": "high_value,standard,micro"
      }
    },
    {
      "name": "HighValueSink",
      "plugin": {"name": "BigQueryTable", "type": "batchsink"},
      "properties": {
        "dataset": "gold", "table": "high_value_transactions",
        "operation": "UPSERT", "tableKey": "txn_id"
      }
    },
    {
      "name": "StandardSink",
      "plugin": {"name": "BigQueryTable", "type": "batchsink"},
      "properties": {
        "dataset": "gold", "table": "standard_transactions",
        "operation": "INSERT"
      }
    }
  ]
}

Hybrid pipelines are the most common output in real-world migrations. Pure SQL pipelines and pure CDAP pipelines are edge cases. MigryX's parser classifies each transformation independently, so a single legacy workflow can produce a pipeline with both SQL pushdown stages and CDAP plugin stages — optimizing for performance and fidelity at every step.

MigryX: Idiomatic Code, Not Line-by-Line Translation

The difference between MigryX and manual migration is not just speed — it is code quality. MigryX generates idiomatic, platform-optimized code that leverages native features of your target platform. A SAS DATA step does not become a clunky row-by-row loop — it becomes a clean, vectorized DataFrame operation. A PROC SQL query does not become a literal translation — it becomes an optimized query that takes advantage of your platform’s pushdown capabilities.

Orchestration: Scheduling and Dependencies

Legacy ETL platforms bundle orchestration with transformation. Informatica workflows, Talend job groups, DataStage sequences, and SSIS packages all include scheduling and dependency management. In Data Fusion, pipelines are scheduled natively or orchestrated through Cloud Composer (Airflow) for complex multi-pipeline DAGs.

# Cloud Composer DAG orchestrating multiple Data Fusion pipelines
from airflow import DAG
from airflow.providers.google.cloud.operators.datafusion import (
    CloudDataFusionStartPipelineOperator,
)
from datetime import datetime

dag = DAG(
    "daily_etl_orchestration",
    schedule_interval="0 6 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
)

extract_pipeline = CloudDataFusionStartPipelineOperator(
    task_id="extract_sources",
    pipeline_name="extract_all_sources",
    instance_name="prod-datafusion",
    location="us-central1",
    dag=dag,
)

transform_pipeline = CloudDataFusionStartPipelineOperator(
    task_id="transform_and_enrich",
    pipeline_name="transform_enrich_orders",
    instance_name="prod-datafusion",
    location="us-central1",
    dag=dag,
)

load_gold = CloudDataFusionStartPipelineOperator(
    task_id="load_gold_layer",
    pipeline_name="load_gold_tables",
    instance_name="prod-datafusion",
    location="us-central1",
    dag=dag,
)

extract_pipeline >> transform_pipeline >> load_gold

MigryX precision parser — Deep AST-level analysis ensures every construct is understood before conversion begins

Platform-Specific Optimization by MigryX

MigryX maintains deep knowledge of every target platform’s strengths and best practices. When converting to Snowflake, it leverages Snowpark and native SQL functions. When targeting Databricks, it uses PySpark DataFrame operations optimized for distributed execution. When generating dbt models, it follows dbt best practices for modularity and testability. This platform awareness is what makes MigryX output production-ready from day one.

How MigryX Decides: SQL or CDAP?

MigryX's parser analyzes each legacy transformation and classifies it into one of three categories:

SQL-expressible — The transformation is a standard SQL operation (SELECT, JOIN, GROUP BY, CASE, window function, MERGE). Output: BigQuery pushdown SQL inside a Data Fusion pipeline stage.
Plugin-required — The transformation involves row-level procedural logic, multi-branch routing, custom functions, or data type manipulation that SQL cannot express cleanly. Output: Native CDAP plugin (Wrangler, JavaScript Transform, Splitter, etc.).
Hybrid — The transformation mixes SQL and procedural logic within a single construct (e.g., Informatica mapping with both SQL Override and Expression transformations). Output: Hybrid pipeline with both SQL pushdown and CDAP stages.

This classification is deterministic, auditable, and overridable. MigryX generates a conversion report that shows exactly why each transformation was routed to SQL or CDAP, with the original legacy code and the generated Data Fusion configuration side by side.

Key Takeaways
Cloud Data Fusion supports two approaches: SQL pushdown to BigQuery (zero data movement, serverless compute) and idiomatic CDAP plugins (visual DAG with Wrangler, Joiner, GroupBy, JavaScript transforms).
SQL-heavy legacy platforms (SAS PROC SQL, Teradata BTEQ, Oracle PL/SQL) map naturally to SQL pushdown pipelines. Visual ETL platforms (Alteryx, Talend, DataStage, SSIS) map to CDAP plugin pipelines.
Real-world migrations produce hybrid pipelines that combine SQL pushdown and CDAP plugins in the same Data Fusion pipeline — optimizing for both performance and transformation fidelity.
Every legacy transformation type has a specific Data Fusion plugin equivalent: Alteryx Formula → Wrangler/JavaScript, Talend tMap → Joiner + Transform, DataStage Transformer → JavaScript Transform, Informatica Router → Splitter.
MigryX generates pipeline JSON (CDAP artifact spec) ready for direct import into Data Fusion via REST API or Studio UI — no manual assembly required.
Orchestration moves from legacy schedulers to Data Fusion native schedules or Cloud Composer DAGs, preserving dependency chains and retry semantics.
Lineage metadata from every generated pipeline is published to Dataplex Data Catalog, maintaining governance continuity from legacy ETL to Google Cloud.

Migrating legacy ETL to Cloud Data Fusion is not a one-size-fits-all exercise. SQL-heavy workloads should push SQL to BigQuery for maximum performance. Visual dataflow logic should use CDAP's native plugin ecosystem. And most real-world migrations will produce hybrid pipelines that combine both approaches. MigryX automates this classification and generates production-ready Data Fusion pipeline JSON from every legacy source — Alteryx, Talend, DataStage, Informatica, SSIS, and ODI — with full lineage published to Dataplex.

Why MigryX Delivers Superior Migration Results

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

Production-ready output: MigryX generates code that passes code review and runs in production — not prototype-quality output that needs weeks of cleanup.
Platform optimization: Converted code leverages target platform-specific features for maximum performance and cost efficiency.
25+ source technologies: Whether migrating from SAS, Informatica, DataStage, SSIS, or any of 25+ legacy technologies, MigryX handles it.
Automated documentation: Every conversion decision is documented with before/after code mappings and transformation rationale.

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to migrate legacy ETL to Cloud Data Fusion?

See how MigryX converts your Alteryx, Talend, DataStage, or Informatica workflows to production-ready Data Fusion pipelines with SQL pushdown and CDAP plugins.

Explore BigQuery Migration Schedule a Demo