The world is awash with digital data from social networks, blogs, business, science, and engineering. Data-intensive computing facilitates understanding of complex problems that must process massive amounts of data. Through the development of new classes of software, algorithms, and hardware, data-intensive applications can provide timely and meaningful analytical results in response to exponentially growing data complexity and associated analysis requirements. This emerging area brings many challenges that are different from traditional high-performance computing. This reference for computing professionals and researchers describes the dimensions of the field, the key challenges, the state of the art, and the characteristics of likely approaches that future data-intensive problems will require. Chapters cover general principles and methods for designing such systems and for managing and analyzing the big data sets of today that live in the cloud, and describe example applications in bioinformatics and cybersecurity that illustrate these principles in practice.
Reviewer: Radu State
The recent shift from central processing unit (CPU)-intensive computation toward data-intensive computation has been supported by several research and innovation initiatives. These range from computational paradigms, like the popular MapReduce programming model and infrastructure, to advanced and customized dedicated hardware, such as using graphics processing units (GPUs) to run general-purpose applications. The motivations for research on data-intensive computations are numerous. Big data will soon be everywhere: it comes from large physics experiments (such as the Hadron Collider experiments in CERN), popular crowdsourcing projects, social networks, and mobile applications, and it represents the outcomes of genetic and pharmacological research. This book is a collection of ten chapters written by different authors, each addressing a specific area in data-intensive processing. The book can be read from cover to cover or on an individual chapter-by-chapter basis. The editors introduce the book with an introduction to data-intensive computation, followed by a second chapter on the anatomy and general processing flow graphs of data-intensive architectures. Hardware-specific architectures are addressed in the third chapter, where the main focus is on the design and evaluation of hardware-assisted string matching (such as Cray, Niagara2, and CUDA). Chapters 4 and 5 discuss related topics on data management. In the fourth chapter, the authors consider different types of data, as well as storage solutions that have emerged from the high-performance computing community. Chapter 5 presents a comprehensive view of the state of the art in existing distributed data-intensive architectures, covering NoSQL databases, distributed file systems (such as the Hadoop DFS), and general programming models (MapReduce), as well as their respective available open-source implementations. The next chapters (6 and 7) explore common preprocessing and data analysis tasks. Dimension reduction, also known as principal component analysis (PCA), is shown to benefit from data-intensive computing, and the authors consider the case of classification with support vector machines. Since the underlying computations are relatively intensive, it makes sense to leverage dedicated multicore systems to improve the performance for these processes. In chapter 8, the authors tackle a very interesting challenge that occurs in existing MapReduce architectures. When the distribution of the reduce tasks is skewed, the performance of the overall MapReduce process can be severely impacted. As a possible solution, the authors introduce an improved MapReduce-like framework called HaLoop, which supports iterative programming and leverages slave node caching and indexing. The last two chapters address domain-specific applications of data-intensive computations. Chapter 9 targets the biomedical domain and shows how genomic data (such as protein sequences) can be analyzed and visualized in real time to identify patterns and similarities among multispecies proteins. In chapter 10, the authors address another main area of relevancy: monitoring network activities. The typical volume of data in network security monitoring can easily become the bottleneck for any anomaly detection method, and it raises significant challenges in terms of storage and processing facilities. The authors present general principles by which such monitoring and visualization tools were implemented and deployed. Overall, I recommend this book for researchers and advanced graduate students. The collection presents different essays for a very rich and diversified overview of one of the most recent and fast-paced revolutions in computer science. Online Computing Reviews Service
Become a reviewer for Computing Reviews.
Big data is a topic of active research in the cloud community. With increasing demand for data storage in the cloud, study of data-intensive applications is becoming a primary focus. Data-intensive applications involve high CPU usage for processing .