The scope of the course “Big Data (Cloudera) Architecture” is to provide an understanding of Apache Hadoop architecture and its different commercial distributions (with a special emphasis on Cloudera and MapR). This course is intended for people working on architectural level.
At the end of the course the participant will be able:
- to understand the Hadoop ecosystem, applicability, and scope of the different components of the Hadoop ecosystem
- to compare two of its commercial distributions: Cloudera and MapR.
The course will include real examples of Cloudera and MapR installations and the architectural points that led to the respective choices.
Knowledge of SQL/noSQL differences, Distributed systems, Network concepts.
Course structure and duration:
Three intensive days, with the following structure:
Day I – May 8: Intro in Big Data & Hadoop architecture/ecosystem
- Big Data: a bit of history and the V’s
- Some well-known and not so much known Big Data uses
- Lambda architecture overview – Data formats – Real time, near real time processing, Batch processing – Machine Learning
- What is Hadoop, HDFS emergence and MapReduce evolution
- Use cases of Hadoop
- Hadoop architecture overview, detailed overview and applicability cases: HDFS & MapReduce essentials, YARN
- Data modelling with Apache Avro, Parquet + Hue/Zeppelin (hands-on exercises)
DAY II – May 8: Other projects that are most often part of the Hadoop ecosystem
- Data ingestion: Sqoop, Flume
- Data storage: Cassandra, Kudu, Hbase, Parquet, Redis
- Data computing
– Distributed computing: Pig, Storm, Spark, Flink
- Data analysis:
– SQL on Hadoop: Hive, Impala
– Search: Solr/Elastic
– Spark SQL
- Machine Learning: Spark
- Other: Oozie, Zookeeper, Hue
- Building 2 possible architectures for streaming data and batch data, including ML: hands-on exercises
Day III – May 10: Commercial distributions of Hadoop
– Architecture & components
– Cloudera specific tools & functions
– 2 real use cases presented
– Architecture and components
– MapR specific features – 2 Real use cases
- Comparison Cloudera and MapR
- We will need open Internet connection throughout the course.
- Each participant needs to have its own computer in order to run the hands-on exercises, also the computer settings must allow access to Google docs and Github for getting access to presenters’ slides, documents, data, and exercises.
- For the local computers that will run Cloudera CDH (we will have a multi-node CDH deployment via Docker):
– laptop/desktop with min 12GB RAM (recommended would be 16GB RAM – see below). See below note, if we go with a server run that the local laptops/desktops can have a 4GB RAM minimum;
– Ubuntu 14.04 or 16.04, 64 bit
– a valid Docker 1.11.0 or newer installation is needed, running on a 64-bit system (either directly on a Mac or Linux machine, or on a VirtualBox – or similar – VM running a 64-bit guest; this means that you’ll end up running Docker inside a VM, this is fine for testing and learning purposes);
– Cloudera clusterdock state (Cloudera framework for creating Docker-based container clusters) requires 16GB of RAM for a two node cluster, but we can start the cluster without some of the services (or not care about the overcommit warnings) and get away with 8-12GB RAM. We recommend for this course a 12GB RAM.
* Note: if the 12GB RAM per computer requirement cannot be met, we can work as well with a remote server (for every 3 students) allocated from the client’s internal infrastructure and made fully accessible throughout the course.
Consultant & Trainer Big Data Technologies
Consultant in Big Data and Cloud domains (solutions architecture) and trainer for Big Data Technologies and Architectures (Apache Cassandra, Apacke Kafka, Hadoop ecosystem and Big Data architectures), with more than 14-years experience in telecom and IT domains, architecting telecom value-added service solutions and leading for several years technical presales teams. Passionate about cloud and data, Valentina teaches Cassandra, Kafka, Hadoop, and big data architecture courses and works in consultancy projects in Big Data domains, organizes Bucharest Big Data meetup and organizes and teaches several Bigdata.ro events.
Investment: 750 euro
For registration, fill in the form below: