Toptube Video Search Engine

Title:Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks

Get the slides: ABOUT THE TALK: Structured Streaming is the next generation of distributed, streaming processing in Apache Spark. Developers can write a query written in their language of choice (Scala/Java/Python/R) using powerful high-level APIs (DataFrames / Datasets / SQL) and apply that same query to both static datasets and streaming data. In case of streaming, Spark will automatically create an incremental execution plan that automatically handles late, out-of-order data and ensures end-to-end exactly-once fault-tolerance guarantees. In this practical session, I will walk through a concrete streaming ETL example where – in less than 10 lines – you can read raw, unstructured data from Kafka data, transform it and write it out as a structured table ready for batch and ad-hoc queries on up-to-the-last-minute data. I will give a quick glimpse of advanced features like event-time based aggregations, stream-stream joins and arbitrary stateful operations. ABOUT THE SPEAKER: Tathagata is a committer and PMC to the Apache Spark project and a Software Engineer at Databricks. He is the lead developer of Spark Streaming, and now focuses primarily on Structured Streaming. Previously, he was a member of the AMPLab, UC Berkeley as a graduate student researcher where he conducted research on data-center frameworks and networks with Scott Shenker and Ion Stoica. ABOUT DATA COUNCIL: Data Council ( is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups. FOLLOW DATA COUNCIL: Twitter: LinkedIn:


Download Server 1


Download Server 2


Alternative Download :