
Introduction in Apache Spark: Open course
This three days course is for data engineers, analysts, architects, software engineers, IT operations and technical managers interested in the overall architecture & components of Apache Spark and understanding Spark through different exercises & use cases, and interactions with different distributed storage systems (HDFS, noSQL).
The course covers Apache Spark main concepts: the core- architecture, RDDs/Dataframes/Datasets, transformations & actions, DAG; SQL engine, streaming engine, machine learning libraries, and as well highlights the possible usage of Spark in differents use cases like: ETL, analytics and Machine Learning.
Day I– May,13: Spark Overview
- A brief history of Spark
- Where Spark fits in the big data landscape
- Apache Spark vs. Apache MapReduce: An overall architecture comparison
- Cluster Architecture: cluster manager, workers, executors; Spark Context; Cluster Manager Types; Deployment scenarios
- How Spark schedules and executes jobs and tasks
- Resilient Distributed Datasets: Fundamentals & hands on exercises
- Ways to create an RDD: Parallelize Collection; Read from external data source (local drive, HDFS, noSQL); From existing RDD
- Introduction to Transformations and Actions
- Caching
- RDD Types
- How transformations lazily build up a Directed Acyclic Graph (DAG)
- Shuffling
- Hands on: using Spark for ETL
Day II– May,14: SparkSQL & DataFrames/Datasets: Fundamentals and hands on exercises
- What are DataFrames/Datasets vs RDD’s
- The DataFrames/Datasets API
- Catalyst Optimizer
- Spark SQL
- Creating and running DataFrame operations
- Reading from multiple data sources (hands on exercises)
Day III– May,14: Spark Streaming
- When to use Structured Spark Streaming
- Structured streaming: Building a Spark streams out of Kafka topics; Windowing & Aggregation; Register a Spark DF stream in memory and query with Spark ML
- Spark MLlib and Spark.ml
- Machine Learning Examples: Collaborative filtering: Alternating Least Squares ; Classification and regression
An end to end Spark example: we will build an end to end case, from data input, data cleaning, data storage and machine learning; we will work in a cloud environment and we will use Apache Zeppelin for all the Spark coding/exercises (Scala).
Requirements: please have a free Internet connection (port 22 open) and Google Chrome available on the working station. Also, we recommend an SSH client available on the working station.
Prerequisites: All exercises will be done in Scala and SQL. Prior knowledge of Scala and SQL syntax cloud be of help for easier understanding of the exercises, but please note that the main scope of the course is to understand the architecture, ways of working and usage of Spark in different use cases. Scala programming is not part of the course objectives
Trainer
VALENTINA CRISAN
Consultant & Trainer Big Data Technologies
Consultant in Big Data and Cloud domains (solutions architecture) and trainer for Big Data Technologies and Architectures (Apache Cassandra, Apacke Kafka, Hadoop ecosystem and Big Data architectures), with more than 14-years experience in telecom and IT domains, architecting telecom value-added service solutions and leading for several years technical presales teams. Passionate about cloud and data, Valentina teaches Cassandra, Kafka, Hadoop, and big data architecture courses and works in consultancy projects in Big Data domains, organizes Bucharest Big Data meetup and organizes and teaches several Bigdata.ro events.
Investment: 750 euro
For registration, fill in the form below: