Data & Computational Resources

1 Data & Computational Resources

Explore the core infrastructure available to support your research—from managing large datasets to conducting computationally intensive analyses and sharing your results openly and reproducibly.

1.1 Data Storage & Management

Working groups are encouraged to use GitHub for version control and collaborative code development. However, GitHub has a 100 MB file size limit per file, making it unsuitable for storing large datasets. Depending on the size and format of your data, consider the following options:

  1. For small datasets (< 50 MB) that are published and accessible via a persistent web link (e.g., data from DataONE), you are encouraged to reference the URL directly within your scripts to minimize redundancy and streamline reproducibility.

  2. For medium-sized datasets (larger than 50 MB but smaller than 100 GB) or unpublished data that need to be shared internally, use a shared Google Drive. Many working groups already have shared Drives—if yours has not been set up, please contact the Data Science Trainer. Organizing raw data within the "data" folder in the shared Google Drive for consistency and ease of use.

  3. For large datasets (> 100 GB), especially if you need to process the data using multiple cores, please reach out to the Data Science Trainer to coordinate access to NCEAS data servers. This ensures appropriate infrastructure for high-capacity and large-scale processing.


1.2 High-Performance & Parallel Computing

Working groups at NCEAS can request access to high-performance computing (HPC) resources to support large-scale processing and computation.

  • To obtain HPC access with R/RStudio pre-installed, contact the Data Science Trainer to obtain access to the Aurora server.

  • If you’re working with a large dataset, using the Parquet file format can significantly improve read/write and computation speed. Parquet is a columnar storage format that allows for efficient compression and faster querying, especially when only a subset of columns is needed. Please see the training materials for how to read and write Parquet files.

  • For parallel computation, please see the tutorials for Python and R. These resources can help you scale your workflows efficiently across multiple cores or nodes.


1.3 R Tips

Here are some optional R modules that fall outside our regular working group training topics. If you have questions about R programming, feel free to reach out—we can develop a custom module as part of our working group resources. These topics are designed to strengthen your data visualization and analysis skills using R:

  • Using sf for Spatial Data & Intro to Making Maps
    A hands-on tutorial introducing spatial data operations and basic mapping in R.
    📍 View the tutorial

1.4 Data Sharing & Publishing

To support open and reproducible science, NCEAS encourages working groups to make their data and code Findable. See the relevant module developed by the NCEAS-LTER team.

  • Consider publishing your datasets in KNB for long-term preservation and reproducibility of your research.

  • Alternatively, publish datasets in trusted repositories such as the Environmental Data Initiative (EDI) using tools like ezEML for metadata creation.

  • Code should be versioned in GitHub and published through Zenodo to obtain a DOI, following Github –> Zenodo steps.

  • The DataOne portal service offers an easy, sustainable way for working groups to showcase and share their datasets. Portals can include searchable data catalogs, embedded maps, visualizations, and Shiny apps—all without needing to maintain a separate website. Data can come from repositories like KNB, EDI, or others in the DataOne network. The service is currently free and ideal for long-term access and visibility of your project’s outputs.

1.5 Other NCEAS resources