1 Data & Computational Resources
Explore the core infrastructure available to support your research—from managing large datasets to conducting computationally intensive analyses and sharing your results openly and reproducibly.
1.1 Data Storage & Management
Working groups are encouraged to use GitHub for version control and collaborative code development. However, GitHub has a 100 MB file size limit per file, making it unsuitable for storing large datasets. Depending on the size and format of your data, consider the following options:
For small datasets (< 50 MB) that are published and accessible via a persistent web link (e.g., data from DataONE), you are encouraged to reference the URL directly within your scripts to minimize redundancy and streamline reproducibility.
For medium-sized datasets (larger than 50 MB but smaller than 100 GB) or unpublished data that need to be shared internally, use a shared Google Drive. Many working groups already have shared Drives—if yours has not been set up, please contact the Data Science Trainer. Organizing raw data within the
"data"
folder in the shared Google Drive for consistency and ease of use.For large datasets (> 100 GB), especially if you need to process the data using multiple cores, please reach out to the Data Science Trainer to coordinate access to NCEAS data servers. This ensures appropriate infrastructure for high-capacity and large-scale processing.
1.2 High-Performance & Parallel Computing
Working groups at NCEAS can request access to high-performance computing (HPC) resources to support large-scale processing and computation.
To obtain HPC access with R/RStudio pre-installed, contact the Data Science Trainer to obtain access to the Aurora server.
If you’re working with a large dataset, using the Parquet file format can significantly improve read/write and computation speed. Parquet is a columnar storage format that allows for efficient compression and faster querying, especially when only a subset of columns is needed. Please see the training materials for how to read and write Parquet files.
For parallel computation, please see the tutorials for Python and R. These resources can help you scale your workflows efficiently across multiple cores or nodes.
1.3 R Tips
Here are some optional R modules that fall outside our regular working group training topics. If you have questions about R programming, feel free to reach out—we can develop a custom module as part of our working group resources. These topics are designed to strengthen your data visualization and analysis skills using R:
- Using
sf
for Spatial Data & Intro to Making Maps
A hands-on tutorial introducing spatial data operations and basic mapping in R.
📍 View the tutorial
1.4 Data Sharing & Publishing
To support open and reproducible science, NCEAS encourages working groups to make their data and code Findable. See the relevant module developed by the NCEAS-LTER team.
Consider publishing your datasets in KNB for long-term preservation and reproducibility of your research.
Alternatively, publish datasets in trusted repositories such as the Environmental Data Initiative (EDI) using tools like ezEML for metadata creation.
Code should be versioned in GitHub and published through Zenodo to obtain a DOI, following Github –> Zenodo steps.
The DataOne portal service offers an easy, sustainable way for working groups to showcase and share their datasets. Portals can include searchable data catalogs, embedded maps, visualizations, and Shiny apps—all without needing to maintain a separate website. Data can come from repositories like KNB, EDI, or others in the DataOne network. The service is currently free and ideal for long-term access and visibility of your project’s outputs.
1.5 Other NCEAS resources
Synthesis Skills for Early Career Researchers (SSECR)
An LTER course designed to build foundational synthesis and collaboration skills.LTER Scientific Computing Team Website
Resources and tools from the LTER community focused on scientific computing best practices.NCEAS Resources for Working Groups
A curated collection of guidance and tools drawn from NCEAS’s extensive experience supporting synthesis science.NCEAS Hight Performance Computing
An overview of high performance computing available to NCEAS working groups.Carpentry @ UCSB library
A list of the free training provided by the UCSB library, including various R and Python courses for the coming quarter.