etna.datasets.TSDataset#

Bases: object

TSDataset is the main class to handle your time series data.

It prepares the series for exploration analyzing, implements feature generation with Transforms and generation of future points.

Notes

TSDataset supports custom indexing and slicing method. It maybe done through these interface: TSDataset[timestamp, segment, column] If at the start of the period dataset contains NaN those timestamps will be removed.

During creation segment is casted to string type.

Examples

>>> from etna.datasets import generate_const_df
>>> df = generate_const_df(periods=30, start_time="2021-06-01", n_segments=2, scale=1)
>>> ts = TSDataset(df, "D")
>>> ts["2021-06-01":"2021-06-07", "segment_0", "target"]
timestamp
2021-06-01    1.0
2021-06-02    1.0
2021-06-03    1.0
2021-06-04    1.0
2021-06-05    1.0
2021-06-06    1.0
2021-06-07    1.0
Freq: D, Name: (segment_0, target), dtype: float64

>>> from etna.datasets import generate_ar_df
>>> pd.options.display.float_format = '{:,.2f}'.format
>>> df_to_forecast = generate_ar_df(100, start_time="2021-01-01", n_segments=1)
>>> df_regressors = generate_ar_df(120, start_time="2021-01-01", n_segments=5)
>>> df_regressors = df_regressors.pivot(index="timestamp", columns="segment").reset_index()
>>> df_regressors.columns = ["timestamp"] + [f"regressor_{i}" for i in range(5)]
>>> df_regressors["segment"] = "segment_0"
>>> tsdataset = TSDataset(df=df_to_forecast, freq="D", df_exog=df_regressors, known_future="all")
>>> tsdataset.head(5)
segment      segment_0
feature    regressor_0 regressor_1 regressor_2 regressor_3 regressor_4 target
timestamp
2021-01-01        1.62       -0.02       -0.50       -0.56        0.52   1.62
2021-01-02        1.01       -0.80       -0.81        0.38       -0.60   1.01
2021-01-03        0.48        0.47       -0.81       -1.56       -1.37   0.48
2021-01-04       -0.59        2.44       -2.21       -1.21       -0.69  -0.59
2021-01-05        0.28        0.58       -3.07       -1.45        0.77   0.28

>>> from etna.datasets import generate_hierarchical_df
>>> pd.options.display.width = 0
>>> df = generate_hierarchical_df(periods=100, n_segments=[2, 4], start_time="2021-01-01",)
>>> df, hierarchical_structure = TSDataset.to_hierarchical_dataset(df=df, level_columns=["level_0", "level_1"])
>>> tsdataset = TSDataset(df=df, freq="D", hierarchical_structure=hierarchical_structure)
>>> tsdataset.head(5)
segment    l0s0_l1s3 l0s1_l1s0 l0s1_l1s1 l0s1_l1s2
feature       target    target    target    target
timestamp
2021-01-01      2.07      1.62     -0.45     -0.40
2021-01-02      0.59      1.01      0.78      0.42
2021-01-03     -0.24      0.48      1.18     -0.14
2021-01-04     -1.12     -0.59      1.77      1.82
2021-01-05     -1.40      0.28      0.68      0.48

Init TSDataset.

Parameters:

df (DataFrame) – dataframe with timeseries in a wide or long format: DataFrameFormat; it is expected that df has feature named “target”
freq (DateOffset | str | None) –
frequency of timestamp in df, possible values:
- pandas.DateOffset object for datetime timestamp
- pandas offset aliases for datetime timestamp
- None for integer timestamp
df_exog (DataFrame | None) – dataframe with exogenous data in a wide or long format: DataFrameFormat
known_future (Literal['all'] | ~typing.Sequence) – columns in df_exog[known_future] that are regressors, if “all” value is given, all columns are meant to be regressors
hierarchical_structure (HierarchicalStructure | None) – Structure of the levels in the hierarchy. If None, there is no hierarchical structure in the dataset.

Methods

`add_features_from_pandas`(df_update[, ...])	Update the dataset with the new columns from pandas dataframe.
`add_prediction_intervals`(prediction_intervals_df)	Add target components into dataset.
`add_target_components`(target_components_df)	Add target components into dataset.
`create_from_misaligned`(df, freq[, df_exog, ...])	Make TSDataset from misaligned data by realigning it according to inferred alignment in `df`.
`describe`([segments])	Overview of the dataset that returns a DataFrame.
`drop_features`(features[, drop_from_exog])	Drop columns with features from the dataset.
`drop_prediction_intervals`()	Drop prediction intervals from dataset.
`drop_target_components`()	Drop target components from dataset.
`fit_transform`(transforms)	Fit and apply given transforms to the data.
`get_level_dataset`(target_level)	Generate new TSDataset on target level.
`get_prediction_intervals`()	Get `pandas.DataFrame` with prediction intervals.
`get_target_components`()	Get DataFrame with target components.
`has_hierarchy`()	Check whether dataset has hierarchical structure.
`head`([n_rows])	Return the first `n_rows` rows.
`info`([segments])	Overview of the dataset that prints the result.
`inverse_transform`(transforms)	Apply inverse transform method of transforms to the data.
`isnull`()	Return dataframe with flag that means if the correspondent element in wide representation of data is null.
`level_names`()	Return names of the levels in the hierarchical structure.
`make_future`(future_steps[, transforms, ...])	Return new TSDataset with features extended into the future.
`plot`([n_segments, column, segments, start, ...])	Plot of random or chosen segments.
`size`()	Return size of TSDataset.
`tail`([n_rows])	Return the last `n_rows` rows.
`to_dataset`(df)	Convert pandas dataframe to wide format.
`to_flatten`(df[, features])	Return pandas DataFrame in a long format.
`to_hierarchical_dataset`(df, level_columns[, ...])	Convert pandas dataframe from long hierarchical to ETNA Dataset format.
`to_pandas`([flatten, features])	Return pandas DataFrame.
`to_torch_dataset`(make_samples[, dropna])	Convert the TSDataset to a `torch.Dataset`.
`train_test_split`([train_start, train_end, ...])	Split given df with train-test timestamp indices or size of test set.
`transform`(transforms)	Apply given transform to the data.
`tsdataset_idx_slice`([start_idx, end_idx])	Return new TSDataset with integer-location based indexing.
`update_features_from_pandas`(df_update)	Update the existing columns in the dataset with the new values from pandas dataframe.

Attributes

`current_df_exog_level`	Return current level of dataframe with exogenous data in hierarchical structure.
`current_df_level`	Return current level of dataframe in hierarchical structure.
`features`	Get list of all features across all segments in dataset.
`freq`	Return string frequency of timestamp.
`freq_offset`	Return offset frequency of timestamp.
`idx`	Shortcut for `pd.core.indexing.IndexSlice`
`known_future`	Return columns in `df_exog` that are initially regressors.
`prediction_intervals_names`	Get a tuple with prediction intervals names.
`regressors`	Get list of all regressors across all segments in dataset.
`segments`	Get list of all segments in dataset.
`target_components_names`	Get tuple with target components names.
`timestamps`	Return TSDataset timestamp index.

add_features_from_pandas(df_update: DataFrame, update_exog: bool = False, regressors: List[str] | None = None)[source]#

Update the dataset with the new columns from pandas dataframe.

Before updating columns in df, columns of df_update will be cropped by the last timestamp in df.

Parameters:

df_update (DataFrame) – Dataframe with the new columns in wide ETNA format.
update_exog (bool) – If True, update columns also in df_exog. If you wish to add new regressors in the dataset it is recommended to turn on this flag.
regressors (List[str] | None) – List of regressors in the passed dataframe.

add_prediction_intervals(prediction_intervals_df: DataFrame)[source]#

Add target components into dataset.

Parameters:

prediction_intervals_df (DataFrame) – Dataframe in a wide format with prediction intervals

Raises:

ValueError: – If dataset already contains prediction intervals
ValueError: – If prediction intervals names differ between segments

add_target_components(target_components_df: DataFrame)[source]#

Add target components into dataset.

Parameters:

target_components_df (DataFrame) – Dataframe in a wide format with target components

Raises:

ValueError: – If dataset already contains target components
ValueError: – If target components names differ between segments
ValueError: – If components don’t sum up to target

classmethod create_from_misaligned(df: DataFrame, freq: DateOffset | str | None, df_exog: DataFrame | None = None, known_future: Literal['all'] | Sequence = (), future_steps: int = 1, original_timestamp_name: str = 'external_timestamp') → TSDataset[source]#

Make TSDataset from misaligned data by realigning it according to inferred alignment in df.

This method: - Infers alignment using infer_alignment(); - Realigns df and df_exog using inferred alignment using apply_alignment(); - Creates exog feature with original timestamp using make_timestamp_df_from_alignment(); - Creates TSDataset from these data.

This method doesn’t work with hierarchical_structure, because it doesn’t make much sense.

Parameters:

df (DataFrame) – dataframe with timeseries in a long format: DataFrameFormat; it is expected that df has feature named “target”
freq (DateOffset | str | None) –
frequency of timestamp in df, possible values:
- pandas.DateOffset object for datetime timestamp
- pandas offset aliases for datetime timestamp
- None for integer timestamp
df_exog (DataFrame | None) – dataframe with exogenous data in a long format: DataFrameFormat
known_future (Literal['all'] | ~typing.Sequence) – columns in df_exog[known_future] that are regressors, if “all” value is given, all columns are meant to be regressors
future_steps (int) – determines on how many steps original timestamp should be extended into the future before adding into df_exog; expected to be positive
original_timestamp_name (str) – name for original timestamp column to add it into df_exog

Returns:

Created TSDataset.

Raises:

ValueError: – If future_steps is not positive.
ValueError: – If original_timestamp_name intersects with columns in df_exog.
ValueError: – Parameter df isn’t in a long format.
ValueError: – Parameter df_exog isn’t in a long format if it set.

Return type:

TSDataset

describe(segments: Sequence[str] | None = None) → DataFrame[source]#

Overview of the dataset that returns a DataFrame.

Method describes dataset in segment-wise fashion. Description columns:

start_timestamp: beginning of the segment, missing values in the beginning are ignored
end_timestamp: ending of the dataset, common for all segments
length: length according to start_timestamp and end_timestamp
num_missing: number of missing variables between start_timestamp and end_timestamp
num_segments: total number of segments, common for all segments
num_exogs: number of exogenous features, common for all segments
num_regressors: number of exogenous factors, that are regressors, common for all segments
num_known_future: number of regressors, that are known since creation, common for all segments
freq: frequency of the series, common for all segments

Parameters:: segments (Sequence[str] | None) – segments to show in overview, if None all segments are shown.
Returns:: result_table – table with results of the overview
Return type:: pd.DataFrame

Examples

>>> from etna.datasets import generate_const_df
>>> pd.options.display.expand_frame_repr = False
>>> df = generate_const_df(
...    periods=30, start_time="2021-06-01",
...    n_segments=2, scale=1
... )
>>> regressors_timestamp = pd.date_range(start="2021-06-01", periods=50)
>>> df_regressors_1 = pd.DataFrame(
...     {"timestamp": regressors_timestamp, "regressor_1": 1, "segment": "segment_0"}
... )
>>> df_regressors_2 = pd.DataFrame(
...     {"timestamp": regressors_timestamp, "regressor_1": 2, "segment": "segment_1"}
... )
>>> df_exog = pd.concat([df_regressors_1, df_regressors_2], ignore_index=True)
>>> ts = TSDataset(df, df_exog=df_exog, freq="D", known_future="all")
>>> ts.describe()
          start_timestamp end_timestamp  length  num_missing  num_segments  num_exogs  num_regressors  num_known_future freq
segments
segment_0      2021-06-01    2021-06-30      30            0             2          1               1                 1    D
segment_1      2021-06-01    2021-06-30      30            0             2          1               1                 1    D

drop_features(features: List[str], drop_from_exog: bool = False)[source]#

Drop columns with features from the dataset.

Parameters:

features (List[str]) – List of features to drop.
drop_from_exog (bool) –
- If False, drop features only from df. Features will appear again in df after make_future.
- If True, drop features from df and df_exog. Features won’t appear in df after make_future.

Raises:

ValueError: – If features list contains target or target components

drop_prediction_intervals()[source]#: Drop prediction intervals from dataset.

drop_target_components()[source]#: Drop target components from dataset.

fit_transform(transforms: Sequence[Transform])[source]#

Fit and apply given transforms to the data.

Parameters:: transforms (Sequence[Transform]) –

get_level_dataset(target_level: str) → TSDataset[source]#

Generate new TSDataset on target level.

Parameters:: target_level (str) – target level name
Returns:: generated dataset
Return type:: TSDataset

get_prediction_intervals() → DataFrame | None[source]#

Get pandas.DataFrame with prediction intervals.

Returns:: pandas.DataFrame with prediction intervals for target variable.
Return type:: DataFrame | None

get_target_components() → DataFrame | None[source]#

Get DataFrame with target components.

Returns:: Dataframe with target components
Return type:: DataFrame | None

has_hierarchy() → bool[source]#

Check whether dataset has hierarchical structure.

Return type:: bool

head(n_rows: int = 5) → DataFrame[source]#

Return the first n_rows rows.

Mimics pandas method.

This function returns the first n_rows rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

For negative values of n_rows, this function returns all rows except the last n_rows rows, equivalent to df[:-n_rows].

Parameters:: n_rows (int) – number of rows to select.
Returns:: the first n_rows rows or 5 by default.
Return type:: pd.DataFrame

info(segments: Sequence[str] | None = None) → None[source]#

Overview of the dataset that prints the result.

Method describes dataset in segment-wise fashion.

Information about dataset in general:

num_segments: total number of segments
num_exogs: number of exogenous features
num_regressors: number of exogenous factors, that are regressors
num_known_future: number of regressors, that are known since creation
freq: frequency of the dataset
end_timestamp: ending of the dataset

Information about individual segments:

start_timestamp: beginning of the segment, missing values in the beginning are ignored
length: length according to start_timestamp and end_timestamp
num_missing: number of missing variables between start_timestamp and end_timestamp

Parameters:: segments (Sequence[str] | None) – segments to show in overview, if None all segments are shown.
Return type:: None

Examples

>>> from etna.datasets import generate_const_df
>>> df = generate_const_df(
...    periods=30, start_time="2021-06-01",
...    n_segments=2, scale=1
... )
>>> regressors_timestamp = pd.date_range(start="2021-06-01", periods=50)
>>> df_regressors_1 = pd.DataFrame(
...     {"timestamp": regressors_timestamp, "regressor_1": 1, "segment": "segment_0"}
... )
>>> df_regressors_2 = pd.DataFrame(
...     {"timestamp": regressors_timestamp, "regressor_1": 2, "segment": "segment_1"}
... )
>>> df_exog = pd.concat([df_regressors_1, df_regressors_2], ignore_index=True)
>>> ts = TSDataset(df, df_exog=df_exog, freq="D", known_future="all")
>>> ts.info()
<class 'etna.datasets.TSDataset'>
num_segments: 2
num_exogs: 1
num_regressors: 1
num_known_future: 1
freq: D
end_timestamp: 2021-06-30 00:00:00
          start_timestamp  length  num_missing
segments
segment_0      2021-06-01      30            0
segment_1      2021-06-01      30            0

inverse_transform(transforms: Sequence[Transform])[source]#

Apply inverse transform method of transforms to the data.

Applied in reversed order.

Parameters:: transforms (Sequence[Transform]) –

isnull() → DataFrame[source]#

Return dataframe with flag that means if the correspondent element in wide representation of data is null.

Wide representation could be obtained by using self.to_pandas().

Returns:: is_null dataframe
Return type:: pd.Dataframe

level_names() → List[str] | None[source]#

Return names of the levels in the hierarchical structure.

Return type:: List[str] | None

make_future(future_steps: int, transforms: Sequence[Transform] = (), tail_steps: int = 0) → TSDataset[source]#

Return new TSDataset with features extended into the future.

Notes

The result dataset doesn’t contain prediction intervals and target components. Some columns and modifications may be lost if a transformed dataset is used to make future. This behavior is due to the usage of an initial state of the dataset to compute the future.

Parameters:

future_steps (int) – number of steps to extend dataset into the future.
transforms (Sequence[Transform]) – sequence of transforms to be applied.
tail_steps (int) – number of steps to keep from the tail of the original dataset.

Returns:

dataset with features extended into the.

Return type:

TSDataset

Examples

>>> from etna.datasets import generate_const_df
>>> df = generate_const_df(
...    periods=30, start_time="2021-06-01",
...    n_segments=2, scale=1
... )
>>> df_regressors = pd.DataFrame({
...     "timestamp": list(pd.date_range("2021-06-01", periods=40))*2,
...     "regressor_1": np.arange(80), "regressor_2": np.arange(80) + 5,
...     "segment": ["segment_0"]*40 + ["segment_1"]*40
... })
>>> ts = TSDataset(
...     df, "D", df_exog=df_regressors, known_future="all"
... )
>>> ts.make_future(4)
segment      segment_0                      segment_1
feature    regressor_1 regressor_2 target regressor_1 regressor_2 target
timestamp
2021-07-01          30          35    NaN          70          75    NaN
2021-07-02          31          36    NaN          71          76    NaN
2021-07-03          32          37    NaN          72          77    NaN
2021-07-04          33          38    NaN          73          78    NaN

plot(n_segments: int = 10, column: str = 'target', segments: Sequence[str] | None = None, start: Timestamp | int | str | None = None, end: Timestamp | int | str | None = None, seed: int = 1, figsize: Tuple[int, int] = (10, 5))[source]#

Plot of random or chosen segments.

Parameters:

n_segments (int) – number of random segments to plot
column (str) – feature to plot
segments (Sequence[str] | None) – segments to plot
seed (int) – seed for local random state
start (Timestamp | int | str | None) – start plot from this timestamp
end (Timestamp | int | str | None) – end plot at this timestamp
figsize (Tuple[int, int]) – size of the figure per subplot with one segment in inches

Raises:

ValueError: – Incorrect type of start or end is used according to freq

size() → Tuple[int, int, int | None][source]#

Return size of TSDataset.

The order of sizes is (number of time series, number of segments, number of features).

Returns:: Tuple of TSDataset sizes
Return type:: Tuple[int, int, int | None]

tail(n_rows: int = 5) → DataFrame[source]#

Return the last n_rows rows.

Mimics pandas method.

This function returns last n_rows rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.

For negative values of n_rows, this function returns all rows except the first n rows, equivalent to df[n_rows:].

Parameters:: n_rows (int) – number of rows to select.
Returns:: the last n_rows rows or 5 by default.
Return type:: pd.DataFrame

static to_dataset(df: DataFrame) → DataFrame[source]#

Convert pandas dataframe to wide format.

Columns “timestamp” and “segment” are required.

Parameters:: df (DataFrame) – DataFrame with columns [“timestamp”, “segment”]. Other columns considered features. Columns “timestamp” is expected to be one of two types: integer or timestamp.
Return type:: DataFrame

Notes

During conversion segment is casted to string type.

Examples

>>> from etna.datasets import generate_const_df
>>> df = generate_const_df(
...    periods=30, start_time="2021-06-01",
...    n_segments=2, scale=1
... )
>>> df.head(5)
   timestamp    segment  target
0 2021-06-01  segment_0    1.00
1 2021-06-02  segment_0    1.00
2 2021-06-03  segment_0    1.00
3 2021-06-04  segment_0    1.00
4 2021-06-05  segment_0    1.00
>>> df_wide = TSDataset.to_dataset(df)
>>> df_wide.head(5)
segment    segment_0 segment_1
feature       target    target
timestamp
2021-06-01      1.00      1.00
2021-06-02      1.00      1.00
2021-06-03      1.00      1.00
2021-06-04      1.00      1.00
2021-06-05      1.00      1.00

>>> df_regressors = pd.DataFrame({
...     "timestamp": pd.date_range("2021-01-01", periods=10),
...     "regressor_1": np.arange(10), "regressor_2": np.arange(10) + 5,
...     "segment": ["segment_0"]*10
... })
>>> TSDataset.to_dataset(df_regressors).head(5)
segment      segment_0
feature    regressor_1 regressor_2
timestamp
2021-01-01           0           5
2021-01-02           1           6
2021-01-03           2           7
2021-01-04           3           8
2021-01-05           4           9

static to_flatten(df: DataFrame, features: Literal['all'] | Sequence[str] = 'all') → DataFrame[source]#

Return pandas DataFrame in a long format.

The order of columns is (timestamp, segment, target, features in alphabetical order).

Parameters:

df (DataFrame) – DataFrame in ETNA format.
features (Literal['all'] | ~typing.Sequence[str]) – List of features to return. If “all”, return all the features in the dataset. Always return columns with timestamp and segment.

Returns:

dataframe with TSDataset data

Return type:

pd.DataFrame

Examples

>>> from etna.datasets import generate_const_df
>>> df = generate_const_df(
...    periods=30, start_time="2021-06-01",
...    n_segments=2, scale=1
... )
>>> df.head(5)
    timestamp    segment  target
0  2021-06-01  segment_0    1.00
1  2021-06-02  segment_0    1.00
2  2021-06-03  segment_0    1.00
3  2021-06-04  segment_0    1.00
4  2021-06-05  segment_0    1.00
>>> df_wide = TSDataset.to_dataset(df)
>>> TSDataset.to_flatten(df_wide).head(5)
   timestamp    segment  target
0 2021-06-01  segment_0    1.0
1 2021-06-02  segment_0    1.0
2 2021-06-03  segment_0    1.0
3 2021-06-04  segment_0    1.0
4 2021-06-05  segment_0    1.0

static to_hierarchical_dataset(df: DataFrame, level_columns: List[str], keep_level_columns: bool = False, sep: str = '_', return_hierarchy: bool = True) → Tuple[DataFrame, HierarchicalStructure | None][source]#

Convert pandas dataframe from long hierarchical to ETNA Dataset format.

Parameters:

df (DataFrame) – Dataframe in long hierarchical format with columns [timestamp, target] + [level_columns] + [other_columns]
level_columns (List[str]) – Columns of dataframe defines the levels in the hierarchy in order from top to bottom i.e [level_name_1, level_name_2, …]. Names of the columns will be used as names of the levels in hierarchy.
keep_level_columns (bool) – If true, leave the level columns in the result dataframe. By default level columns are concatenated into “segment” column and dropped
sep (str) – String to concatenated the level names with
return_hierarchy (bool) – If true, returns the hierarchical structure

Returns:

Dataframe in wide format and optionally hierarchical structure

Raises:

ValueError – If level_columns is empty

Return type:

Tuple[DataFrame, HierarchicalStructure | None]

to_pandas(flatten: bool = False, features: Literal['all'] | Sequence[str] = 'all') → DataFrame[source]#

Return pandas DataFrame.

Parameters:

flatten (bool) –
- If False, return dataframe in a wide format
- If True, return dataframe in a long format, its order of columns is (timestamp, segment, target, features in alphabetical order).
features (Literal['all'] | ~typing.Sequence[str]) – List of features to return. If “all”, return all the features in the dataset.

Returns:

dataframe with TSDataset data

Return type:

pd.DataFrame

Examples

>>> from etna.datasets import generate_const_df
>>> df = generate_const_df(
...    periods=30, start_time="2021-06-01",
...    n_segments=2, scale=1
... )
>>> df.head(5)
    timestamp    segment  target
0  2021-06-01  segment_0    1.00
1  2021-06-02  segment_0    1.00
2  2021-06-03  segment_0    1.00
3  2021-06-04  segment_0    1.00
4  2021-06-05  segment_0    1.00
>>> ts = TSDataset(df, "D")
>>> ts.to_pandas(True).head(5)
    timestamp    segment  target
0  2021-06-01  segment_0    1.00
1  2021-06-02  segment_0    1.00
2  2021-06-03  segment_0    1.00
3  2021-06-04  segment_0    1.00
4  2021-06-05  segment_0    1.00
>>> ts.to_pandas(False).head(5)
segment    segment_0 segment_1
feature       target    target
timestamp
2021-06-01      1.00      1.00
2021-06-02      1.00      1.00
2021-06-03      1.00      1.00
2021-06-04      1.00      1.00
2021-06-05      1.00      1.00

to_torch_dataset(make_samples: Callable[[DataFrame], Iterator[dict] | Iterable[dict]], dropna: bool = True) → Dataset[source]#

Convert the TSDataset to a torch.Dataset.

Parameters:

make_samples (Callable[[DataFrame], Iterator[dict] | Iterable[dict]]) – function that takes per segment DataFrame and returns iterabale of samples
dropna (bool) – if True, missing rows are dropped

Returns:

torch.Dataset with with train or test samples to infer on

Return type:

Dataset

train_test_split(train_start: Timestamp | int | str | None = None, train_end: Timestamp | int | str | None = None, test_start: Timestamp | int | str | None = None, test_end: Timestamp | int | str | None = None, test_size: int | None = None) → Tuple[TSDataset, TSDataset][source]#

Split given df with train-test timestamp indices or size of test set.

In case of inconsistencies between test_size and (test_start, test_end), test_size is ignored

During splitting all the features are kept in train and test parts including target, regressors, target components, prediction intervals.

Parameters:

train_start (Timestamp | int | str | None) – start timestamp of new train dataset, if None first timestamp is used
train_end (Timestamp | int | str | None) – end timestamp of new train dataset, if None previous to test_start timestamp is used
test_start (Timestamp | int | str | None) – start timestamp of new test dataset, if None next to train_end timestamp is used
test_end (Timestamp | int | str | None) – end timestamp of new test dataset, if None last timestamp is used
test_size (int | None) – number of timestamps to use in test set

Returns:

generated datasets

Return type:

train, test

Raises:

ValueError: – Incorrect type of train_start or train_end or test_start or test_end is used according to ts.freq

Examples

>>> from etna.datasets import generate_ar_df
>>> pd.options.display.float_format = '{:,.2f}'.format
>>> df = generate_ar_df(100, start_time="2021-01-01", n_segments=3)
>>> ts = TSDataset(df, "D")
>>> train_ts, test_ts = ts.train_test_split(
...     train_start="2021-01-01", train_end="2021-02-01",
...     test_start="2021-02-02", test_end="2021-02-07"
... )
>>> train_ts.tail(5)
segment    segment_0 segment_1 segment_2
feature       target    target    target
timestamp
2021-01-28     -2.06      2.03      1.51
2021-01-29     -2.33      0.83      0.81
2021-01-30     -1.80      1.69      0.61
2021-01-31     -2.49      1.51      0.85
2021-02-01     -2.89      0.91      1.06
>>> test_ts.head(5)
segment    segment_0 segment_1 segment_2
feature       target    target    target
timestamp
2021-02-02     -3.57     -0.32      1.72
2021-02-03     -4.42      0.23      3.51
2021-02-04     -5.09      1.02      3.39
2021-02-05     -5.10      0.40      2.15
2021-02-06     -6.22      0.92      0.97

transform(transforms: Sequence[Transform])[source]#

Apply given transform to the data.

Parameters:: transforms (Sequence[Transform]) –

tsdataset_idx_slice(start_idx: int | None = None, end_idx: int | None = None) → TSDataset[source]#

Return new TSDataset with integer-location based indexing.

Parameters:

start_idx (int | None) – starting integer index (counting from 0) of the slice.
end_idx (int | None) – last integer index (counting from 0) of the slice.

Returns:

TSDataset based on indexing slice.

Return type:

TSDataset

update_features_from_pandas(df_update: DataFrame)[source]#

Update the existing columns in the dataset with the new values from pandas dataframe.

Before updating columns in df, columns of df_update will be cropped by the last timestamp in df. Columns in df_exog are not updated. If you wish to update the df_exog, create the new instance of TSDataset.

Updating df with df_update with different corresponding column dtypes could lead to unexpected behaviour in different pandas versions.

Parameters:

df_update (DataFrame) – Dataframe with new values in wide ETNA format.

Raises:

ValueError: – If timestamps do not match
ValueError: – If there are columns in the update dataframe that are not presented in the dataset
ValueError: – If there are duplicate features in the dataset (columns with the same name)

property current_df_exog_level: str | None[source]#

Return current level of dataframe with exogenous data in hierarchical structure.

Returns:: Level of dataframe with exogenous data
Return type:: str or None

property current_df_level: str | None[source]#

Return current level of dataframe in hierarchical structure.

Returns:: Level of dataframe
Return type:: str or None

property features: List[str][source]#

Get list of all features across all segments in dataset.

All features include initial exogenous data, generated features, target, target components, prediction intervals. The order of features in returned list isn’t specified.

Returns:: List of features.

property freq: str | None[source]#

Return string frequency of timestamp.

Returns:: String frequency of timestamp.
Return type:: str or None

property freq_offset: DateOffset | None[source]#

Return offset frequency of timestamp.

Returns:: Offset frequency of timestamp.
Return type:: BaseOffset or None

idx = <pandas.core.indexing._IndexSlice object>[source]#: Shortcut for pd.core.indexing.IndexSlice

property known_future: List[str][source]#

Return columns in df_exog that are initially regressors.

Returns:: List of regressor columns

property prediction_intervals_names: Tuple[str, ...][source]#: Get a tuple with prediction intervals names. Return an empty tuple in the case of intervals absence.

property regressors: List[str][source]#

Get list of all regressors across all segments in dataset.

Examples

>>> from etna.datasets import generate_const_df
>>> df = generate_const_df(
...    periods=30, start_time="2021-06-01",
...    n_segments=2, scale=1
... )
>>> regressors_timestamp = pd.date_range(start="2021-06-01", periods=50)
>>> df_regressors_1 = pd.DataFrame(
...     {"timestamp": regressors_timestamp, "regressor_1": 1, "segment": "segment_0"}
... )
>>> df_regressors_2 = pd.DataFrame(
...     {"timestamp": regressors_timestamp, "regressor_1": 2, "segment": "segment_1"}
... )
>>> df_exog = pd.concat([df_regressors_1, df_regressors_2], ignore_index=True)
>>> ts = TSDataset(
...     df, df_exog=df_exog, freq="D", known_future="all"
... )
>>> ts.regressors
['regressor_1']

property segments: List[str][source]#

Get list of all segments in dataset.

Examples

>>> from etna.datasets import generate_const_df
>>> df = generate_const_df(
...    periods=30, start_time="2021-06-01",
...    n_segments=2, scale=1
... )
>>> ts = TSDataset(df, "D")
>>> ts.segments
['segment_0', 'segment_1']

property target_components_names: Tuple[str, ...][source]#: Get tuple with target components names. Components sum up to target. Return the empty tuple in case of components absence.

property timestamps: Index[source]#

Return TSDataset timestamp index.

Returns:: timestamp index of TSDataset