Data Integration

Talend’s contributions to Apache Beam

Generic icon representing a headshot of a person

Talend Team

3 min read

Apache Beam is an open-source, unified programming model for batch and streaming data processing pipelines that simplifies large-scale data processing dynamics. The Apache Beam model offers powerful abstractions that insulate you from low-level details of distributed data processing, such as coordinating individual workers, reading from sources and writing to sinks, etc. 

At Talend, we use Apache Beam in some of our products. For instance, Beam is the core component we use to run the pipelines in Talend Pipeline Designer. Talend also actively contributes to Apache Beam’s development to make the future of distributed data processing and advanced technology solutions a reality. Internally, we have a dedicated team of Talendians to fix bugs, add new features, and test new releases.  

One of these Talendians is Alexey Romanenko, Principal Software Engineer, who has been part of the open source development team for the past five years. He also participates regularly in open source events, starting with the annual Beam Summit.  

Alexey shares some of his ideas and expertise around Apache Beam in this collection of speaking sessions: 

Introduction to performance testing in Apache Beam [2022]

Benchmarking is an important but ambiguous part of performance testing for any software system. It is even more challenging for systems that support different language SDKs and distributed data processing runtimes, such as Apache Beam. 

In this session, Alexey presents the benchmark frameworks Apache Beam already supports (Nexmark, TPC-DS) and how they can be used for different purposes (like release testing) by developers and ordinary users. 

TPC-DS and Apache Beam — the time has come! [2021]

TPC-DS is the de facto SQL-based benchmark framework used to measure database systems and big data processing frameworks.  

In this session, Alexey and Ismaël Mejia introduce TPC-DS and present the different ways of running the TPC-DS benchmark on Beam via Beam SQL and “classical” Beam Java SDK. He also covers some issues related to Beam SQL, several performance optimizations, and the challenges of fair benchmarking on distributed processing systems. 

Using Cross-language pipeline to run Python 3 code with Java SDK  [2020]

The end of life of Python 2 made it more challenging to execute Python code in Java SDK pipelines — and vice versa — since not all old solutions still work well for Python 3. One of the potential solutions is to use a cross-language pipeline and portable runner in Apache Beam. 

In this session Alexey defines cross-language pipelines in Beam and explains how to create and run a mixed Java/Python pipeline. The demo is about executing a custom user’s Python 3 code in the middle of Java SDK pipeline and running it with Portable Spark Runner. 

In this article:

Data Integration

Ready To Get Started?