Introduction in Apache Spark

Valentina Crisan
Introduction in Apache Spark

Acest curs este predat în limba română, iar materialele sunt în limba engleză şi/sau în limba română, după caz.

La cerere, cursul poate fi personalizat.

The course covers Apache Spark main concepts: the core ( architecture,
RDDs/Dataframes/Datasets, transformations & actions, DAG), SQL engine, streaming
engine, machine learning libraries and as well highlights the possible usage of Spark in
differents use cases like: ETL, analytics and Machine Learning.

An end to end Spark example – We will build an end to end case, from data input, data
cleaning, data storage and machine learning. We will work in a cloud environment and
we will use Apache Zeppelin for all the Spark coding/exercises (Scala).

Caracteristici curs

  • Capitole 24
  • Durata 3 zile
  • Nivel cunostinte Orice nivel
  • Limba Romana
  • Cursanti 12
  • Day 1: Spark Overview

    • Capitol 1.1 A brief history of Spark Locked 0m
    • Capitol 1.2 Where Spark fits in the big data landscape Locked 0m
    • Capitol 1.3 Apache Spark vs. Apache MapReduce: An overall architecture comparison Locked 0m
    • Capitol 1.4 Cluster Architecture: cluster manager, workers, executors; Spark Context; Cluster Manager Types; Deployment scenarios Locked 0m
    • Capitol 1.5 How Spark schedules and executes jobs and tasks Locked 0m
    • Capitol 1.6 Resilient Distributed Datasets : Fundamentals & hands on exercises Locked 0m
    • Capitol 1.7 Ways to create an RDD: Parallelize Collection; Read from external data source; From existing RDD Locked 0m
    • Capitol 1.8 Introduction to Transformations and Actions Locked 0m
    • Capitol 1.9 Caching Locked 0m
    • Capitol 1.10 RDD Types Locked 0m
    • Capitol 1.11 How transformations lazily build up a Directed Acyclic Graph (DAG) Locked 0m
    • Capitol 1.12 Shuffling Locked 0m
    • Capitol 1.13 Hands on: using Spark for ETL Locked 0m
  • Day 2: SparkSQL and DataFrames/Datasets : Fundamentals and hands on exercises

    • Capitol 2.1 What are DataFrames/Datasets vs RDD’s Locked 0m
    • Capitol 2.2 The DataFrames/Datasets API Locked 0m
    • Capitol 2.3 Spark SQL Locked 0m
    • Capitol 2.4 Creating and running DataFrame operations Locked 0m
    • Capitol 2.5 Reading from multiple data sources and hands on exercises: HDFS; noSQL; Hive Locked 0m
  • Day 3: Spark Streaming

    • Capitol 3.1 When to use Spark Streaming Locked 0m
    • Capitol 3.2 DStream Locked 0m
    • Capitol 3.3 Structured streaming: Building a Spark streams out of Kafka topics; Windowing & Aggregation; Register a Spark DF stream Locked 0m
    • Capitol 3.4 Spark’s MLlib and MLlib Pipeline API ( for Machine Learning Locked 0m
    • Capitol 3.5 Spark MLlib and Locked 0m
    • Capitol 3.6 Machine Learning Examples: Collaborative filtering: Alternating Least Squares; Classification and regression Locked 0m
Valentina Crisan