statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. The actual data is accessible by the data attribute. For example:
In [1]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/api.py in <module>()
13 from .discrete.discrete_model import (Poisson, Logit, Probit,
14 MNLogit, NegativeBinomial)
---> 15 from .tsa import api as tsa
16 from .duration.hazard_regression import PHReg
17 from .nonparametric import api as nonparametric
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
2 from .arima_model import ARMA, ARIMA
3 from . import vector_ar as var
4 from .arima_process import arma_generate_sample, ArmaProcess
5 from .vector_ar.var_model import VAR
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/ar_model.py in <module>()
14 cache_readonly, cache_writable)
15 from statsmodels.tools.numdiff import approx_fprime, approx_hess
---> 16 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
17 import statsmodels.base.wrapper as wrap
18 from statsmodels.tsa.vector_ar import util
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from .kalmanfilter import KalmanFilter
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
31 from numpy.linalg import inv, pinv
32 from statsmodels.tools.tools import chain_dot
---> 33 from . import kalman_loglike
34
35 #Fast filtering and smoothing for multivariate state space models
ImportError: cannot import name kalman_loglike
In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
NameError: name 'sm' is not defined
In [3]: print duncan_prestige.__doc__
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-9b4cf6ceaa3f> in <module>()
----> 1 print duncan_prestige.__doc__
NameError: name 'duncan_prestige' is not defined
In [4]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)
NameError: name 'duncan_prestige' is not defined
get_rdataset(dataname[, package, cache]) | download and return R dataset |
get_data_home([data_home]) | Return the path of the statsmodels data dir. |
clear_data_home([data_home]) | Delete all the content of the data home cache. |
Load a dataset:
In [5]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-5-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/api.py in <module>()
13 from .discrete.discrete_model import (Poisson, Logit, Probit,
14 MNLogit, NegativeBinomial)
---> 15 from .tsa import api as tsa
16 from .duration.hazard_regression import PHReg
17 from .nonparametric import api as nonparametric
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
2 from .arima_model import ARMA, ARIMA
3 from . import vector_ar as var
4 from .arima_process import arma_generate_sample, ArmaProcess
5 from .vector_ar.var_model import VAR
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/ar_model.py in <module>()
14 cache_readonly, cache_writable)
15 from statsmodels.tools.numdiff import approx_fprime, approx_hess
---> 16 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
17 import statsmodels.base.wrapper as wrap
18 from statsmodels.tsa.vector_ar import util
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from .kalmanfilter import KalmanFilter
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
31 from numpy.linalg import inv, pinv
32 from statsmodels.tools.tools import chain_dot
---> 33 from . import kalman_loglike
34
35 #Fast filtering and smoothing for multivariate state space models
ImportError: cannot import name kalman_loglike
In [6]: data = sm.datasets.longley.load()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-6daf677753dc> in <module>()
----> 1 data = sm.datasets.longley.load()
NameError: name 'sm' is not defined
The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data attribute.
In [7]: data.data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-42500bbde965> in <module>()
----> 1 data.data
NameError: name 'data' is not defined
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [8]: data.endog[:5]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-ecf121fa201d> in <module>()
----> 1 data.endog[:5]
NameError: name 'data' is not defined
In [9]: data.exog[:5,:]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-eb86cb28e7fa> in <module>()
----> 1 data.exog[:5,:]
NameError: name 'data' is not defined
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [10]: data.endog_name
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-78ac46fd3666> in <module>()
----> 1 data.endog_name
NameError: name 'data' is not defined
In [11]: data.exog_name
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-53b38d63b171> in <module>()
----> 1 data.exog_name
NameError: name 'data' is not defined
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [12]: type(data.data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-2a4072828d02> in <module>()
----> 1 type(data.data)
NameError: name 'data' is not defined
In [13]: type(data.raw_data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-55b385c14017> in <module>()
----> 1 type(data.raw_data)
NameError: name 'data' is not defined
In [14]: data.names
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-bb6578e2a1cd> in <module>()
----> 1 data.names
NameError: name 'data' is not defined
For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data readily available as pandas objects:
In [15]: data = sm.datasets.longley.load_pandas()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-dd9cc940a6dd> in <module>()
----> 1 data = sm.datasets.longley.load_pandas()
NameError: name 'sm' is not defined
In [16]: data.exog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-16-a6a50950081b> in <module>()
----> 1 data.exog
NameError: name 'data' is not defined
In [17]: data.endog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-5f625520ab35> in <module>()
----> 1 data.endog
NameError: name 'data' is not defined
The full DataFrame is available in the data attribute of the Dataset object
In [18]: data.data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-42500bbde965> in <module>()
----> 1 data.data
NameError: name 'data' is not defined
With pandas integration in the estimation classes, the metadata will be attached to model results:
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']