NREL National Solar Radiation Database (NSRDB) - HSDS Demo

 

This notebook demonstrates basic usage of the National Renewable Energy Laboratory (NREL) National Solar Radiation Database (NSRDB) data. The data is provided from Amazon Web Services using the HDF Group's Highly Scalable Data Service (HSDS).

 

For this to work you must first install h5pyd:

pip install --user git+http://github.com/HDFGroup/h5pyd.git

Then, you can make a configuration file at ~/.hscfg with contents like so:

# HDFCloud configuration file
hs_endpoint = https://developer.nrel.gov/api/hsds/
hs_username = None
hs_password = None
hs_api_key = 3K3JQbjZmWctY0xmIfSYvYgtIcM3CN0cb1Y2w9bf

The example API key here is for demonstation and is rate-limited per IP. To get your own API key, visit https://developer.nrel.gov/signup/

In [99]:
%matplotlib inline
import h5pyd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from scipy.spatial import cKDTree
 

Basic Usage

The NSRDB is provided in annual .h5 files and currently spans 1998-2017.
Each year can be accessed from /nrel/nsrdb/nsrdb_${year}.h5

In [3]:
# Open the desired year of nsrdb data
# server endpoint, username, password is found via a config file
f = h5pyd.File("/nrel/nsrdb/nsrdb_2012.h5", 'r')
In [4]:
list(f.attrs)  # list attributes belonging to the root group
Out[4]:
['version']
In [5]:
f.attrs['version']   # attributes can be used to provide desriptions of the content
Out[5]:
'v3.0.1'
 

Datasets

In [6]:
list(f)  # list the datasets in the file
Out[6]:
['air_temperature',
 'clearsky_dhi',
 'clearsky_dni',
 'clearsky_ghi',
 'cloud_type',
 'coordinates',
 'dew_point',
 'dhi',
 'dni',
 'fill_flag',
 'ghi',
 'meta',
 'relative_humidity',
 'solar_zenith_angle',
 'surface_albedo',
 'surface_pressure',
 'time_index',
 'total_precipitable_water',
 'wind_direction',
 'wind_speed']
In [7]:
# Datasets are stored in a 2d array of time x location
dset = f['ghi']
dset.shape
Out[7]:
(17568, 2018392)
In [18]:
# Extract datetime index for datasets
time_index = pd.to_datetime(f['time_index'][...].astype(str))
time_index # Temporal resolution is 30min
Out[18]:
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:30:00',
               '2012-01-01 01:00:00', '2012-01-01 01:30:00',
               '2012-01-01 02:00:00', '2012-01-01 02:30:00',
               '2012-01-01 03:00:00', '2012-01-01 03:30:00',
               '2012-01-01 04:00:00', '2012-01-01 04:30:00',
               ...
               '2012-12-31 19:00:00', '2012-12-31 19:30:00',
               '2012-12-31 20:00:00', '2012-12-31 20:30:00',
               '2012-12-31 21:00:00', '2012-12-31 21:30:00',
               '2012-12-31 22:00:00', '2012-12-31 22:30:00',
               '2012-12-31 23:00:00', '2012-12-31 23:30:00'],
              dtype='datetime64[ns]', length=17568, freq=None)
In [19]:
# Locational information is stored in either 'meta' or 'coordinates'
meta = pd.DataFrame(f['meta'][...])
meta.head()
Out[19]:
 latitudelongitudeelevationtimezonecountrystatecountyurbanpopulationlandcover
0 -19.99 -175.259995 0.0 13 b'None' b'None' b'None' b'None' -9999 210
1 -19.99 -175.220001 0.0 13 b'None' b'None' b'None' b'None' -9999 210
2 -19.99 -175.179993 0.0 13 b'None' b'None' b'None' b'None' -9999 210
3 -19.99 -175.139999 0.0 13 b'None' b'None' b'None' b'None' -9999 210
4 -19.99 -175.100006 0.0 13 b'None' b'None' b'None' b'None' -9999 210
In [20]:
# Datasets have been saved as integers
dset.dtype
Out[20]:
dtype('int16')
In [21]:
dset.shape[0] * dset.shape[1] * 2 * 10**-9 # 70 GB per dataset!
Out[21]:
70.918221312
In [23]:
dset.chunks # Chunked by week
Out[23]:
(336, 2976)
In [24]:
dset.chunks[0] * dset.chunks[1] * 2 * 10**-6 # 2 MB per chunk
Out[24]:
1.9998719999999999
In [27]:
# To convert dataset values back to floats use the 'psm_scale_factor'
dset.attrs['psm_scale_factor'] # Irradiance values have been truncated to integer precision
Out[27]:
1
In [32]:
# wind speed on the other hand has single decimal percision when scaled by 10
scale_factor = f['wind_speed'].attrs['psm_scale_factor']
units = f['wind_speed'].attrs['psm_units']
print('wind_speed scale factor = ', scale_factor)
print('wind_speed units after unscaling = ', units)
f['wind_speed'][0, 0] / scale_factor # divide by scale_factor to return native value
 
wind_speed scale factor =  10.0
wind_speed units after unscaling =  degrees
Out[32]:
3.9
 

Time-slicing

 

Get the time_index from the server and convert to a pandas DatetimeIndex for convenience:

In [34]:
time_index = pd.to_datetime(f['time_index'][...].astype(str))
time_index
Out[34]:
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 00:30:00',
               '2012-01-01 01:00:00', '2012-01-01 01:30:00',
               '2012-01-01 02:00:00', '2012-01-01 02:30:00',
               '2012-01-01 03:00:00', '2012-01-01 03:30:00',
               '2012-01-01 04:00:00', '2012-01-01 04:30:00',
               ...
               '2012-12-31 19:00:00', '2012-12-31 19:30:00',
               '2012-12-31 20:00:00', '2012-12-31 20:30:00',
               '2012-12-31 21:00:00', '2012-12-31 21:30:00',
               '2012-12-31 22:00:00', '2012-12-31 22:30:00',
               '2012-12-31 23:00:00', '2012-12-31 23:30:00'],
              dtype='datetime64[ns]', length=17568, freq=None)
 

Extract indexes for a particular span of time:

In [37]:
march = time_index.month == 3
np.where(march)[0]
Out[37]:
array([2880, 2881, 2882, ..., 4365, 4366, 4367])
 

Or a particular date:

In [65]:
timestep = np.where(time_index == '2012-07-04 00:00:00')[0][0]
timestep
Out[65]:
8880
 

Map Data

In [66]:
# Extract coordinates (lat, lon)
print(dict(f['coordinates'].attrs))
coords = f['coordinates'][...]
 
{'description': '(latitude, longitude)'}
In [93]:
dset = f['ghi']
%time data = dset[timestep, ::10]   # extract every 10th location at a particular time
df = pd.DataFrame() # Combine data with coordinates in a DataFrame
df['longitude'] = coords[::10, 1]
df['latitude'] = coords[::10, 0]
df['ghi'] = data / dset.attrs['psm_scale_factor'] # unscale dataset
 
CPU times: user 5.54 ms, sys: 3.71 ms, total: 9.25 ms
Wall time: 2.66 s
In [94]:
df.shape
Out[94]:
(201840, 3)
In [95]:
df.plot.scatter(x='longitude', y='latitude', c='ghi',
                colormap='YlOrRd',
                title=str(time_index[timestep]))
plt.show()
 
In [96]:
# Full resolution subset of Colorado
meta = pd.DataFrame(f['meta'][...])
CA = meta.loc[meta['state'] == b'California'] # Note .h5 saves strings as bit-strings
CA.head()
Out[96]:
 latitudelongitudeelevationtimezonecountrystatecountyurbanpopulationlandcover
70276 32.529999 -117.099998 55.062500 -8 b'United States' b'California' b'San Diego' b'None' 32326 130
70588 32.570000 -117.099998 7.100000 -8 b'United States' b'California' b'San Diego' b'Tijuana' 27971 190
70589 32.570000 -117.059998 24.920000 -8 b'United States' b'California' b'San Diego' b'Tijuana' 51608 190
70590 32.570000 -117.019997 96.599998 -8 b'United States' b'California' b'San Diego' b'Tijuana' 15236 110
70591 32.570000 -116.980003 140.600006 -8 b'United States' b'California' b'San Diego' b'Tijuana' 2949 130
In [97]:
%time data = dset[timestep][CA.index]  # full-resolution subset
df = CA[['longitude', 'latitude']].copy()
df['ghi'] = data / dset.attrs['psm_scale_factor']
df.shape
 
CPU times: user 31.4 ms, sys: 27.3 ms, total: 58.8 ms
Wall time: 3.96 s
Out[97]:
(26010, 3)
In [98]:
df.plot.scatter(x='longitude', y='latitude', c='ghi',
                colormap='YlOrRd',
                title=str(time_index[timestep]))
plt.show()
 
 

Nearest Timeseries for given Lat/Lon

In [103]:
# Unlike the gridded WTK data the NSRDB is provided as sparse time-series dataset.
# The quickest way to find the nearest site it using a KDtree

dset_coords = f['coordinates'][...]
tree = cKDTree(dset_coords)
def nearest_site(tree, lat_coord, lon_coord):
    lat_lon = np.array([lat_coord, lon_coord])
    dist, pos = tree.query(lat_lon)
    return pos

NewYorkCity = (40.7128, -74.0059)
NewYorkCity_idx = nearest_site(tree, NewYorkCity[0], NewYorkCity[1] )

print("Site index for New York City: \t\t {}".format(NewYorkCity_idx))
print("Coordinates of New York City: \t {}".format(NewYorkCity))
print("Coordinates of nearest point: \t {}".format(dset_coords[NewYorkCity_idx]))
 
Site index for New York City: 		 1244690
Coordinates of New York City: 	 (40.7128, -74.0059)
Coordinates of nearest point: 	 [ 40.73 -74.02]
In [104]:
# Get the entire 2012 timeseries data for a point in NYC
%time tseries = dset[:, NewYorkCity_idx] / dset.attrs['psm_scale_factor']
 
CPU times: user 2.99 ms, sys: 1.55 ms, total: 4.54 ms
Wall time: 1.55 s
In [105]:
len(tseries)   # 1 years * 365 days * 24 hours * 30 minutes
Out[105]:
17568
In [107]:
plt.plot(time_index, tseries)
plt.ylabel("ghi")
plt.title("NYC ghi in 2012")
Out[107]:
Text(0.5, 1.0, 'NYC ghi in 2012')
 
 

GHI Statistics

In [108]:
df = pd.DataFrame({'ghi': tseries}, index=time_index)
df["year"] = df.index.year
df["month"] = df.index.month
df["day"] = df.index.day
df["hour"] = df.index.hour

agg = df.groupby(["month","hour"]).mean()
agg = agg.reset_index().pivot(index="month",columns="hour",values="ghi")
agg
Out[108]:
hour0123456789...14151617181920212223
month                     
1 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 191.806452 290.774194 314.822581 328.016129 299.096774 224.919355 150.903226 48.370968 0.387097 0.000000
2 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 280.068966 356.431034 410.603448 429.120690 407.586207 353.517241 230.068966 113.689655 17.482759 0.000000
3 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 377.451613 488.935484 528.177419 553.048387 539.645161 462.306452 352.483871 219.112903 81.096774 1.838710
4 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 568.500000 680.566667 697.900000 685.133333 627.966667 561.566667 422.916667 293.516667 147.283333 25.700000
5 1.580645 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 468.274194 563.596774 581.967742 634.758065 616.290323 546.241935 431.596774 325.645161 183.983871 71.096774
6 11.500000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 602.450000 664.683333 723.700000 677.850000 606.800000 579.633333 468.983333 338.783333 200.316667 92.250000
7 9.048387 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 572.790323 699.177419 719.612903 715.532258 646.532258 604.596774 495.548387 370.564516 226.983871 93.225806
8 0.161290 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 562.032258 634.000000 671.564516 690.048387 647.725806 556.048387 448.274194 313.483871 163.693548 42.693548
9 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 468.150000 555.883333 604.966667 559.650000 553.516667 473.583333 347.800000 207.400000 74.933333 2.783333
10 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 271.209677 325.741935 397.661290 372.354839 329.112903 261.532258 186.387097 86.548387 7.645161 0.000000
11 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 270.050000 332.133333 353.333333 361.533333 321.416667 236.550000 135.333333 29.450000 0.000000 0.000000
12 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 141.564516 209.838710 242.903226 248.741935 240.516129 175.483871 85.177419 13.806452 0.000000 0.000000

12 rows × 24 columns

In [109]:
plt.imshow(agg)
plt.xlabel("Hour")
plt.ylabel("Month")
plt.title("12 x 24 Mean GHI (W/m^2)")
plt.colorbar()
Out[109]:
<matplotlib.colorbar.Colorbar at 0x127226c18>