Hadoop for Developers
Course Curriculum
Module 1:
1.Introduction to Big Data and Hadoop
2.Components of Hadoop and Hadoop Architecture
3.HDFS, Map Reduce & Yarn Deep Dive
4.Installation & Configuration of Hadoop in a VM(Single Node)
5.Multinode Installation(3 Nodes)
a.On-Premise in Local Machines
b.Cloud
6. Performance tuning, Advanced administration activities, Monitoring the Hadoop Cluster
a. Hadoop Bench Marking(Teragen & Terasort on 10 GB Data)
b.Hadoop Web UI monitoring
c.Advanced Hadoop Administration commands from CLI
d. Tuning the Hadoop cluster by tweaking the Performance tuning Parameters for HDFS & MapReduce framework
e. Node Commissioning(addition) and Decommissioning(Removing)
f.Running Balancer to redistribute the Data in Hadoop
7. Writing MapReduce programs in Java: Wordcount
a. Webserver Log Analysis
b.Recommendation Engine(Product Recommendation generator)
c.Sentiment Analysis
d.Custom Record Readers, Partitioners, Combiners
e.Distributed Copy
8.Introduction and learning to Pig, Pig Latin: Installation & Wordcount
a.Webserver Log analysis
b.Sentiment Analysis
c.Processing JSON data in Pig using Elephant Bird library
d.Advanced Pig processing using Piggybank Library
e.Building Pig UDFs and calling from Pig scripts
9.Advanced Pig Concepts
a.Performance Tuning parameters
b.Controlling parallelism
c. Running Pig Scripts on Tez
10.Introduction and learning to Hive: Installation & Wordcount
a.Webserver Log analysis
b.(Product Based Recommendation)
c.(Product Based Recommendation)
d.Hive Performance Tuning Parameters
e.Loading CSV data, JSON data, etc in Hive
f.Hive File Formats including Text, ORC, Parquet
11.Introduction and learning to Sqoop
a.Advanced Sqoop Import-export options using Queries
b.Controlling Parallelism
12.Introduction to Hbase, Installation and HBase Queries
13.Zookeeper for Coordination, Hbase Multinode installation with Zookeeper
14.Cloudera and Hortonworks Distribution of Hadoop
15.Deploying a Multinode Hadoop Cluster using Ambari
16.Workflow Scheduling using Oozie for Automation
Module 2:
- Other Components of the Hadoop ecosystem
- Flume for Realtime data collection
- Kafka for Realtime Log analysis: Log Filtering
- Spark for Realtime In-memory Analytics
- Advanced Spark Concepts, Spark Programming APIs, Spark RDDs
- Spark Controlling Parallelism, Partitions & Persistence
- Spark SQL
- Spark Streaming
- Scala Programming Basics to Advanced
- Python Introduction & Python Spark programming using PySpark
- The spark for Realtime Log analysis: Analytics
- Creating and Deploying End-to-End Web Log Analysis Solution
- Realtime Log collection using Flume
- Filtering the Logs in Kafka
- Realtime Threat detection in Spark using Logs from Kafka Stream
- Click Stream analysis using Spark
- Hadoop MR2 deployment(Yarn) Integration with Spark
- Spark Machine Learning concepts and Lambda Architecture
- Machine Learning using ML Lib
- Customer Churn Modeling using Spark ML Lib
- Zeppelin for Data Visualization, Spark Programming in Zeppelin using iPython Notebooks
- Case studies & POC – Run Hadoop on a Medium size dataset(~5GB Data), POC can be on realtime project from your company or Duratech’s Live project
Ready to get started?
Get in touch, or to apply for Demo class
Duratech Solutions
Duratech Solutions is incorporated in 2012 and has successfully operated in the global software development industry for 7 Years.
We are the leaders in Coimbatore offering Trainings in Bigdata and Data Science, we are the only training provider in Coimbatore offering Deep Learning, the highest level of Machine Learning & Artificial Intelligence Technology. Our students have got placed in various companies like IBM, Sonata Software, Deloitte, etc
Reach Us
320N,Arpee Complex, NSR Road, SaiBaba Colony, Coimbatore-641 011. Tamil Nadu, India
256, 2nd Floor Sathy Rd,DPK Complex,Sathy Main Road,Opp. to Perumal Kovil,Saravanampatti, Coimbatore, Tamil Nadu - 641035