Author(s): Bengfort, B.; Kim, J. | Publisher: O’Reilly | Year: 2016 | Language: English | Pages: 288 | Size: 7 MB | Extension: pdf
Ready to use statistical
and machine-learning techniques across large data sets? This practical
guide shows you why the Hadoop ecosystem is perfect for the job. Instead
of deployment, operations, or software development usually associated
with distributed computing, you’ll focus on particular analyses you can
build, the data warehousing techniques that Hadoop provides, and higher
order data workflows this framework can produce.
Data scientists
and analysts will learn how to perform a wide range of techniques, from
writing MapReduce and Spark applications with Python to using advanced
modeling and data management with Spark MLlib, Hive, and HBase. You’ll
also learn about the analytical processes and data systems available to
build and empower data products that can handle—and actually
require—huge amounts of data.
Understand core concepts behind Hadoop
and cluster computingUse design patterns and parallel analytical
algorithms to create distributed data analysis jobsLearn about data
management, mining, and warehousing in a distributed context using
Apache Hive and HBaseUse Sqoop and Apache Flume to ingest data from
relational databasesProgram complex Hadoop and Spark applications with
Apache Pig and Spark DataFramesPerform machine learning techniques such
as classification, clustering, and collaborative filtering with Spark’s
MLlib
Table of contents : Part I Introduction to Distributed Computing..............1
Part II Workflows and Tools for Big Data Science..............129
Appendix A Creating a Hadoop PseudoDistributed Development Environment..............227
Appendix B Installing Hadoop Ecosystem Products..............237