{"project":{"acronym":"","projectId":18712,"title":"SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics for Scientific Data and Analysis","primaryTaxonomyNodes":[{"taxonomyNodeId":10833,"taxonomyRootId":8816,"parentNodeId":10831,"level":3,"code":"TX11.4.2","title":"Intelligent Data Understanding","definition":"Intelligent data understanding technologies provide the ability to automatically mine and analyze datasets that are large, noisy, and of varying modalities, including discrete, continuous, text, and graphics, and extract or discover information that can be used for further analysis or decision making.","exampleTechnologies":"Intelligent data collection and prioritization toolset, event detection and intelligent action toolset, data on demand toolset, intelligent data search and mining toolset, data fusion toolset, information representation standards for persistent data, artificial intelligence (AI), robot-automated cross-program standardization","hasChildren":false,"hasInteriorContent":true}],"startTrl":3,"currentTrl":3,"description":"We will construct SciSpark, a scalable system for interactive model evaluation and for the rapid development of climate metrics and analyses. SciSpark directly leverages the Apache Spark technology and its notion of Resilient Distributed Datasets (RDDs). RDDs represent an immutable data set that can be reused across multi-stage operations, partitioned across multiple machines and automatically reconstructed if a partition is lost. The RDD notion directly enables the reuse of array data across multi-stage operations and it ensures data can be replicated, distributed and easily reconstructed in different storage tiers, e.g., memory for fast interactivity, SSDs for near real time availability and I/O oriented spinning disk for later operations. RDDs also allow Spark's performance to degrade gracefully when there is not sufficient memory available to the system. It may seem surprising to consider an in-memory solution for massive datasets, however a recent study found that at Facebook 96% of active jobs could have their entire data inputs in memory at the same time. In addition, it is worth noting that Spark has shown to be 100x faster in memory and 10x faster on disk than Apache Hadoop, the de facto industry platform for Big Data. Hadoop scales well and there are emerging examples of its use in NASA climate projects (e.g., Teng et al. and Schnase et al.) but as is being discovered in these projects, Hadoop is most suited for batch processing and long running operations. SciSpark contributes a Scientific RDD that corresponds to a multi-dimensional array representing a scientific measurement subset by space, or by time. Scientific RDDs can be created in a handful of ways by: (1) directly loading HDF and NetCDF data into Hadoop Distributed File System (HDFS); (2) creating a partition or split function that divides up a multi-dimensional array by space or time; (3) taking the results of a regridding operation or a climate metrics computation; or (4) telling SciSpark to cache an existing Scientific RDD (sRDD), keeping it cached in memory for data reuse between stages. Scientific RDDs will form the basis for a variety of advanced and interactive climate analyses, starting by default in memory, and then being cached and replicated to disk when not directly needed. SciSpark will also use the Shark interactive SQL technology that allows structured query language (SQL) to be used to store/retrieve RDDs; and will use Apache Mesos to be a good tenant in cloud environments interoperating with other data system frameworks (e.g., HDFS, iRODS, SciDB, etc.). One of the key components of SciSpark is interactive sRDD visualizations and to accomplish this SciSpark delivers a user interface built around the Data Driven Documents (D3) framework. D3 is an immersive, javascript based technology that exploits the underlying Document Object Model (DOM) structure of the web to create histograms, cartographic displays and inspections of climate variables and statistics. SciSpark is evaluated using several topical iterative scientific algorithms inspired by the NASA RCMES project including machine-learning (ML) based clustering of temperature PDFs and other quantities over North America, and graph-based algorithms for searching for Mesocale Convective Complexes in West Africa.","destinations":[{"lkuCodeId":1543,"code":"EARTH","description":"Earth","lkuCodeTypeId":526,"lkuCodeType":{"codeType":"DESTINATION_TYPE","description":"Destination Type"}}],"startYear":2015,"startMonth":3,"endYear":2017,"endMonth":2,"statusDescription":"Completed","principalInvestigators":[{"contactId":75315,"canUserEdit":false,"firstName":"Christian","lastName":"Mattmann","fullName":"Christian A Mattmann","fullNameInverted":"Mattmann, Christian A","middleInitial":"A","primaryEmail":"chris.a.mattmann@nasa.gov","publicEmail":true,"nacontact":false}],"programDirectors":[{"contactId":363458,"canUserEdit":false,"firstName":"Pamela","lastName":"Millar","fullName":"Pamela S Millar","fullNameInverted":"Millar, Pamela S","middleInitial":"S","primaryEmail":"pamela.s.millar@nasa.gov","publicEmail":true,"nacontact":false}],"programManagers":[{"contactId":190272,"canUserEdit":false,"firstName":"Jacqueline","lastName":"Le Moigne","fullName":"Jacqueline J Le Moigne","fullNameInverted":"Le Moigne, Jacqueline J","middleInitial":"J","primaryEmail":"Jacqueline.J.LeMoigne-Stewart@nasa.gov","publicEmail":true,"nacontact":false}],"coInvestigators":[{"contactId":501061,"canUserEdit":false,"firstName":"Yolanda","lastName":"Gil","fullName":"Yolanda Gil","fullNameInverted":"Gil, Yolanda","primaryEmail":"gil@isi.edu","publicEmail":false,"nacontact":false},{"contactId":275766,"canUserEdit":false,"firstName":"Kim","lastName":"Whitehall","fullName":"Kim D Whitehall","fullNameInverted":"Whitehall, Kim D","middleInitial":"D","primaryEmail":"kim.d.whitehall@jpl.nasa.gov","publicEmail":true,"nacontact":false},{"contactId":183380,"canUserEdit":false,"firstName":"Hugo","lastName":"Lee","fullName":"Hugo K Lee","fullNameInverted":"Lee, Hugo K","middleInitial":"K","primaryEmail":"huikyo.lee@jpl.nasa.gov","publicEmail":true,"nacontact":false},{"contactId":369382,"canUserEdit":false,"firstName":"Paul","lastName":"Loikith","fullName":"Paul C Loikith","fullNameInverted":"Loikith, Paul C","middleInitial":"C","primaryEmail":"ploikith@pdx.edu","publicEmail":false,"nacontact":false},{"contactId":130116,"canUserEdit":false,"firstName":"Duane","lastName":"Waliser","fullName":"Duane E Waliser","fullNameInverted":"Waliser, Duane E","middleInitial":"E","primaryEmail":"duane.waliser@jpl.nasa.gov","publicEmail":true,"nacontact":false},{"contactId":259861,"canUserEdit":false,"firstName":"Karen","lastName":"Piggee","fullName":"Karen R Piggee","fullNameInverted":"Piggee, Karen R","middleInitial":"R","primaryEmail":"karen.r.piggee@jpl.nasa.gov","publicEmail":true,"nacontact":false},{"contactId":54174,"canUserEdit":false,"firstName":"Brian","lastName":"Wilson","fullName":"Brian D Wilson","fullNameInverted":"Wilson, Brian D","middleInitial":"D","primaryEmail":"Brian.Wilson@jpl.nasa.gov","publicEmail":true,"nacontact":false},{"contactId":292062,"canUserEdit":false,"firstName":"Lewis","lastName":"McGibbney","fullName":"Lewis J Mcgibbney","fullNameInverted":"McGibbney, Lewis J","middleInitial":"J","primaryEmail":"lewis.j.mcgibbney@jpl.nasa.gov","publicEmail":true,"nacontact":false},{"contactId":142449,"canUserEdit":false,"firstName":"Eric","lastName":"Fetzer","fullName":"Eric J Fetzer","fullNameInverted":"Fetzer, Eric J","middleInitial":"J","primaryEmail":"eric.j.fetzer@jpl.nasa.gov","publicEmail":true,"nacontact":false},{"contactId":223462,"canUserEdit":false,"firstName":"Jinwon","lastName":"Kim","fullName":"Jinwon Kim","fullNameInverted":"Kim, Jinwon","primaryEmail":"jkim@atmos.ucla.edu","publicEmail":false,"nacontact":false}],"website":"","libraryItems":[],"transitions":[],"responsibleMd":{"acronym":"SMD","canUserEdit":false,"city":"","external":false,"linkCount":0,"organizationId":4909,"organizationName":"Science Mission Directorate","organizationType":"NASA_Mission_Directorate","naorganization":false,"organizationTypePretty":"NASA Mission Directorate"},"program":{"acronym":"AIST","active":true,"description":"
Advanced Information Systems Technology:
Facilitating the transformation of Earth observation concepts into data, information, and knowledge to benefit society
Information technology plays a critical role in collecting, managing and analyzing very large amounts of Earth observation data and information. ESTO’s Advanced Information Systems Technology (AIST) program serves the NASA research community by providing tools and techniques to acquire, process, access, visualize and otherwise communicate Earth science data.
Individual projects address the research community’s need for tools to simulate and develop sensor measurement concepts, as well as operations concepts and software systems to acquire and manage data for research and applications. The AIST program enables computer scientists to apply best practices from the rapidly evolving information technology fields to NASA’s unique interdisciplinary science challenges, to help the Earth science community to produce groundbreaking science and fully exploit the unique vantage point of space-based Earth observations.
","parentProgram":{"acronym":"ESD","active":true,"description":"ESTO's technology development approach is end-to-end: