Skip to content

PyMFE to Torch Module

This module provides a PyTorch implementation of Meta-Feature Extraction (MFE) for statistical analysis of tabular data.

Overview

The PyMFE to Torch module converts meta-feature extraction operations into PyTorch tensors, enabling differentiable computation of statistical properties during GAN training. This is essential for the Meta-Feature Statistics (MFS) preservation component of the WGAN-GP implementation.

Key Features

Statistical Meta-Features

  • Correlation: Pearson correlation coefficients between features
  • Covariance: Covariance matrix computation
  • Eigenvalues: Principal component eigenvalues for dimensionality analysis
  • Distributional Statistics: Mean, variance, standard deviation, range, min, max
  • Advanced Statistics: Skewness, kurtosis, interquartile range, sparsity

PyTorch Integration

  • Differentiable Operations: All computations maintain gradient flow
  • GPU Acceleration: CUDA-compatible tensor operations
  • Batch Processing: Efficient computation over data batches
  • Device Management: Automatic device placement for tensors

MFEToTorch Class

The main class that provides: - Feature method mapping for easy access to statistical functions - Torch-native implementations of traditional meta-feature extraction - Integration with the training loop for real-time MFS computation - Support for subset feature selection for targeted preservation

Usage in Training

This module is crucial for the MFS-enhanced WGAN-GP training, where it: 1. Computes meta-features for real data variates 2. Calculates corresponding features for generated synthetic data 3. Enables Wasserstein distance computation between feature distributions 4. Provides gradients for generator optimization

wgan_gp.pymfe_to_torch

MFEToTorch

A class to compute meta-features using PyTorch.

This class provides methods to calculate various meta-features for a given dataset using PyTorch tensors. It includes functionalities for computing statistical measures, correlation, covariance, and other properties of the data.

Meta-Feature Statistics (MFS) Available:

Feature Name Method Description
cor ft_cor_torch Correlation matrix (absolute values of lower triangle)
cov ft_cov_torch Covariance matrix (absolute values of lower triangle)
eigenvalues ft_eigenvals Eigenvalues of the covariance matrix
iq_range ft_iq_range Interquartile range (Q3 - Q1)
gravity ft_gravity_torch Distance between majority and minority class centers
kurtosis ft_kurtosis Fourth moment about the mean (tailedness)
skewness ft_skewness Third moment about the mean (asymmetry)
mad ft_mad Median Absolute Deviation
max ft_max Maximum values along dimension 0
min ft_min Minimum values along dimension 0
mean ft_mean Mean values along dimension 0
median ft_median Median values along dimension 0
range ft_range Range (max - min) along dimension 0
sd ft_std Standard deviation along dimension 0
var ft_var Variance along dimension 0
sparsity ft_sparsity Feature sparsity (diversity of unique values)
Usage

The class can be used to extract meta-features from datasets for GAN training with Meta-Feature Statistics preservation. Common subsets include:

  • Basic statistics: ['mean', 'var', 'sd']
  • Distribution properties: ['skewness', 'kurtosis', 'mad']
  • Relationships: ['cor', 'cov', 'eigenvalues']
  • Range measures: ['min', 'max', 'range', 'iq_range']
  • Classification features: ['gravity'] (requires target variable)

Attributes:

Name Type Description
device device

Device for computation (default: 'cpu')

Source code in wgan_gp/pymfe_to_torch.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
class MFEToTorch:
    """
    A class to compute meta-features using PyTorch.

    This class provides methods to calculate various meta-features for a given
    dataset using PyTorch tensors. It includes functionalities for computing
    statistical measures, correlation, covariance, and other properties of the
    data.

    Meta-Feature Statistics (MFS) Available:

    | Feature Name | Method | Description |
    |--------------|--------|-------------|
    | `cor` | `ft_cor_torch` | Correlation matrix (absolute values of lower triangle) |
    | `cov` | `ft_cov_torch` | Covariance matrix (absolute values of lower triangle) |
    | `eigenvalues` | `ft_eigenvals` | Eigenvalues of the covariance matrix |
    | `iq_range` | `ft_iq_range` | Interquartile range (Q3 - Q1) |
    | `gravity` | `ft_gravity_torch` | Distance between majority and minority class centers |
    | `kurtosis` | `ft_kurtosis` | Fourth moment about the mean (tailedness) |
    | `skewness` | `ft_skewness` | Third moment about the mean (asymmetry) |
    | `mad` | `ft_mad` | Median Absolute Deviation |
    | `max` | `ft_max` | Maximum values along dimension 0 |
    | `min` | `ft_min` | Minimum values along dimension 0 |
    | `mean` | `ft_mean` | Mean values along dimension 0 |
    | `median` | `ft_median` | Median values along dimension 0 |
    | `range` | `ft_range` | Range (max - min) along dimension 0 |
    | `sd` | `ft_std` | Standard deviation along dimension 0 |
    | `var` | `ft_var` | Variance along dimension 0 |
    | `sparsity` | `ft_sparsity` | Feature sparsity (diversity of unique values) |

    Usage:
        The class can be used to extract meta-features from datasets for GAN training
        with Meta-Feature Statistics preservation. Common subsets include:

        - Basic statistics: `['mean', 'var', 'sd']`
        - Distribution properties: `['skewness', 'kurtosis', 'mad']`
        - Relationships: `['cor', 'cov', 'eigenvalues']`
        - Range measures: `['min', 'max', 'range', 'iq_range']`
        - Classification features: `['gravity']` (requires target variable)

    Attributes:
        device (torch.device): Device for computation (default: 'cpu')
    """

    device = torch.device("cpu")

    @property
    def feature_methods(self):
        """
        Returns a dictionary that maps feature names to their corresponding extraction methods.

        This mapping is essential for calculating a comprehensive set of statistical
        properties on both real and synthetic datasets. These features are then
        used to evaluate the quality and utility of the generated synthetic data
        by comparing them against the features of the real data.

        Returns:
            dict: A dictionary where keys are feature names (strings) and
                values are the corresponding feature extraction methods.
                See the class docstring for a complete table of available features.
        """
        return {
            "cor": self.ft_cor_torch,
            "cov": self.ft_cov_torch,
            "eigenvalues": self.ft_eigenvals,
            "iq_range": self.ft_iq_range,
            "gravity": self.ft_gravity_torch,
            "kurtosis": self.ft_kurtosis,
            "skewness": self.ft_skewness,
            "mad": self.ft_mad,
            "max": self.ft_max,
            "min": self.ft_min,
            "mean": self.ft_mean,
            "median": self.ft_median,
            "range": self.ft_range,
            "sd": self.ft_std,
            "var": self.ft_var,
            "sparsity": self.ft_sparsity,
        }

    @staticmethod
    def ft_gravity_torch(
        N: torch.Tensor,
        y: torch.Tensor,
        norm_ord: Union[int, float] = 2,
        classes: Optional[torch.Tensor] = None,
        class_freqs: Optional[torch.Tensor] = None,
        cls_inds: Optional[torch.Tensor] = None,
    ):
        """
        Computes the gravity between the majority and minority classes.

        This method calculates the distance between the mean feature vectors of the
        majority and minority classes. This distance serves as a measure of class
        separation in the feature space. By computing this "gravity," the method
        quantifies the dissimilarity between the most and least frequent classes,
        providing insight into the dataset's class distribution and feature
        representation. This information can be valuable for assessing the quality
        and representativeness of generated synthetic data compared to real data.

        Args:
            N: Feature tensor of shape (num_instances, num_features).
            y: Target tensor of shape (num_instances,).
            norm_ord: Order of the norm to compute the distance (e.g., 2 for Euclidean). Defaults to 2.
            classes: Optional tensor of unique class labels. If None, it's computed from `y`.
            class_freqs: Optional tensor of class frequencies. If None, it's computed from `y`.
            cls_inds: Optional list of indices for each class. If provided, it uses these indices to select instances.

        Returns:
            torch.Tensor: The gravity value, representing the distance between the class centers.
        """
        if classes is None or class_freqs is None:
            classes, class_freqs = torch.unique(y, return_counts=True)

        ind_cls_maj = torch.argmax(class_freqs)
        class_maj = classes[ind_cls_maj]

        remaining_classes = torch.cat(
            (classes[:ind_cls_maj], classes[ind_cls_maj + 1 :])
        )
        remaining_freqs = torch.cat(
            (class_freqs[:ind_cls_maj], class_freqs[ind_cls_maj + 1 :])
        )

        ind_cls_min = torch.argmin(remaining_freqs)

        if cls_inds is not None:
            insts_cls_maj = N[cls_inds[ind_cls_maj]]
            if ind_cls_min >= ind_cls_maj:
                ind_cls_min += 1
            insts_cls_min = N[cls_inds[ind_cls_min]]
        else:
            class_min = remaining_classes[ind_cls_min]
            insts_cls_maj = N[y == class_maj]
            insts_cls_min = N[y == class_min]

        center_maj = insts_cls_maj.mean(dim=0)
        center_min = insts_cls_min.mean(dim=0)
        gravity = torch.norm(center_maj - center_min, p=norm_ord)

        return gravity

    def change_device(self, device):
        """
        Changes the device where computations will be performed.

        Args:
            device (str): The target device (e.g., 'cpu', 'cuda').

        This method is crucial for ensuring that the model and data reside on the same device,
        allowing for efficient computation and utilization of available hardware resources
        during the synthetic data generation and evaluation processes.
        """
        self.device = device

    @staticmethod
    def cov(tensor, rowvar=True, bias=False):
        """
        Estimates the covariance matrix of a given tensor, crucial for understanding the statistical relationships within the data. This is a key step in evaluating how well the generated synthetic data captures the underlying dependencies present in the original data.

                Args:
                    tensor (torch.Tensor): Input data tensor.
                    rowvar (bool, optional): If True (default), rows represent variables, with observations in the columns. If False, columns represent variables.
                    bias (bool, optional): If False (default), then the normalization is by N-1. Otherwise, normalization is by N.

                Returns:
                    torch.Tensor: The covariance matrix of the input tensor.
        """
        tensor = tensor if rowvar else tensor.transpose(-1, -2)
        tensor = tensor - tensor.mean(dim=-1, keepdim=True)
        factor = 1 / (tensor.shape[-1] - int(not bool(bias)))
        return factor * tensor @ tensor.transpose(-1, -2).conj()

    def corrcoef(self, tensor, rowvar=True):
        """
        Calculates the Pearson product-moment correlation coefficients, normalizing the covariance matrix by the standard deviations to obtain correlation values. This provides a measure of the linear relationship between variables in the input tensor, which is useful for comparing real and synthetic data.

        Args:
            tensor (torch.Tensor): Input data tensor.
            rowvar (bool, optional): If True (default), rows represent variables, with observations in the columns. Otherwise, columns represent variables.

        Returns:
            torch.Tensor: Pearson product-moment correlation coefficients matrix.
        """
        covariance = self.cov(tensor, rowvar=rowvar)
        variance = covariance.diagonal(0, -1, -2)
        if variance.is_complex():
            variance = variance.real
        stddev = variance.sqrt()
        covariance /= stddev.unsqueeze(-1)
        covariance /= stddev.unsqueeze(-2)
        if covariance.is_complex():
            covariance.real.clip_(-1, 1)
            covariance.imag.clip_(-1, 1)
        else:
            covariance.clip_(-1, 1)
        return covariance

    def ft_cor_torch(self, N: torch.Tensor) -> torch.Tensor:
        """
        Calculates the absolute values of the lower triangle elements of a correlation matrix to quantify feature dependencies.

        This method computes the correlation matrix of the input tensor `N`,
        extracts the elements from the lower triangle (excluding the diagonal),
        and returns the absolute values of these elements. This is done to summarize the relationships between features,
        which is useful for evaluating how well the synthetic data captures the dependencies present in the real data.
        By focusing on the lower triangle and taking absolute values, the method efficiently provides a measure of feature interconnectedness,
        ignoring self-correlations and directionality.

        Args:
            N: The input tensor for which to compute the correlation matrix.

        Returns:
            torch.Tensor: A tensor containing the absolute values of the elements
                in the lower triangle of the correlation matrix.
        """
        corr_mat = self.corrcoef(N, rowvar=False)
        res_num_rows, _ = corr_mat.shape

        tril_indices = torch.tril_indices(res_num_rows, res_num_rows, offset=-1)
        inf_triang_vals = corr_mat[tril_indices[0], tril_indices[1]]

        return torch.abs(inf_triang_vals)

    def ft_cov_torch(
        self,
        N: torch.Tensor,
    ) -> torch.Tensor:
        """
        Calculates the absolute values of the lower triangular elements of the covariance matrix. This focuses on the relationships between variables, extracting the lower triangle to reduce redundancy and focusing on key covariance values. The absolute value ensures that the magnitude of the covariance is considered, regardless of the direction of the relationship.

        Args:
            N: Input tensor for covariance calculation.

        Returns:
            torch.Tensor: A tensor containing the absolute values of the lower triangular elements of the covariance matrix.
        """
        cov_mat = self.cov(N, rowvar=False)

        res_num_rows = cov_mat.shape[0]
        tril_indices = torch.tril_indices(res_num_rows, res_num_rows, offset=-1)
        inf_triang_vals = cov_mat[tril_indices[0], tril_indices[1]]

        return torch.abs(inf_triang_vals)

    def ft_eigenvals(self, x: torch.Tensor) -> torch.Tensor:
        """
        Computes the eigenvalues of the covariance matrix of the input tensor.

        This function is crucial for assessing the diversity and information
        content of the input data. By calculating the eigenvalues of the
        covariance matrix, we gain insights into the principal components
        and variance distribution within the data, which helps to ensure
        the generated synthetic data retains the key statistical
        characteristics of the original data.

        Args:
            x (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: The eigenvalues of the covariance matrix.
        """
        # taking real part of first two eigenvals
        centered = x - x.mean(dim=0, keepdim=True)
        covs = self.cov(centered, rowvar=False)
        return torch.linalg.eigvalsh(covs)

    @staticmethod
    def ft_iq_range(X: torch.Tensor) -> torch.Tensor:
        """
        Calculates the interquartile range (IQR) of a tensor along the first dimension.

        The IQR is a measure of statistical dispersion, representing the difference between the 75th and 25th percentiles. This is useful for understanding the spread of the data, which helps to assess the utility of generated synthetic data by comparing its distribution to the real data.

        Args:
            X: The input tensor of shape [num_samples, num_features].

        Returns:
            The interquartile range of the input tensor, with shape [num_features]. This represents the spread of each feature across the samples.
        """
        q75, q25 = torch.quantile(X, 0.75, dim=0), torch.quantile(X, 0.25, dim=0)
        iqr = q75 - q25  # shape: [num_features]
        return iqr

    @staticmethod
    def ft_kurtosis(x: torch.Tensor) -> torch.Tensor:
        """
        Calculates the kurtosis of a tensor.

        This function computes the kurtosis of the input tensor `x`, a statistical measure
        describing the shape of the data's distribution, specifically its tailedness.
        By calculating kurtosis, we can assess how well the generated data's distribution
        matches that of the real data, ensuring the synthetic data retains similar statistical
        properties. This is crucial for maintaining the utility of the generated data in downstream tasks.

        Args:
            x (torch.Tensor): Input tensor.

        Returns:
            torch.Tensor: The kurtosis of the input tensor.
        """
        mean = torch.mean(x)
        diffs = x - mean
        var = torch.mean(torch.pow(diffs, 2.0))
        std = torch.pow(var, 0.5)
        zscores = diffs / std
        kurtoses = torch.mean(torch.pow(zscores, 4.0)) - 3.0
        return kurtoses

    @staticmethod
    def ft_skewness(x: torch.Tensor) -> torch.Tensor:
        """
        Computes the skewness of a tensor.

        This function calculates the skewness of the input tensor, a key statistical
        measure reflecting the asymmetry of the data distribution. Preserving this characteristic
        is crucial when generating synthetic data to maintain the real data's statistical properties.

        Args:
            x (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: The skewness of the input tensor.
        """
        mean = torch.mean(x)
        diffs = x - mean
        var = torch.mean(torch.pow(diffs, 2.0))
        std = torch.pow(var, 0.5)
        zscores = diffs / std
        skews = torch.mean(torch.pow(zscores, 3.0))
        return skews

    @staticmethod
    def ft_mad(x: torch.Tensor, factor: float = 1.4826) -> torch.Tensor:
        """
        Compute the Median Absolute Deviation (MAD) of a tensor.

        The MAD is a robust measure of statistical dispersion, useful for
        understanding the spread of data in both real and synthetic datasets.
        It helps assess how well the generated data captures the variability
        present in the original data.

        Args:
            x: The input tensor.
            factor: A scaling factor to make the MAD an unbiased estimator of the
                standard deviation for normal data. Default is 1.4826, which
                applies when the data is normally distributed.

        Returns:
            torch.Tensor: The MAD of the input tensor.
        """
        m = x.median(dim=0, keepdim=True).values
        ama = torch.abs(x - m)
        mama = ama.median(dim=0).values
        return mama / (1 / factor)

    @staticmethod
    def ft_mean(N: torch.Tensor) -> torch.Tensor:
        """
        Computes the mean of a tensor along the first dimension to aggregate information across samples. This is useful for summarizing the central tendency of features in the generated or real data.

        Args:
            N (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: The mean of the input tensor along dimension 0.
        """
        return N.mean(dim=0)

    @staticmethod
    def ft_max(N: torch.Tensor) -> torch.Tensor:
        """
        Finds the maximum value in a tensor along dimension 0. This is used to identify the most prominent features across a dataset, which is crucial for maintaining data utility in generated synthetic data.

        Args:
            N (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: A tensor containing the maximum values along dimension 0.
        """
        return N.max(dim=0, keepdim=False).values

    @staticmethod
    def ft_median(N: torch.Tensor) -> torch.Tensor:
        """
        Calculates the median of a tensor along the first dimension. This is used to derive a representative central tendency of the data distribution, which is a crucial aspect of maintaining data utility in synthetic data generation.

        Args:
            N: The input tensor.

        Returns:
            torch.Tensor: A tensor containing the median values along the first dimension.
        """
        return N.median(dim=0).values

    @staticmethod
    def ft_min(N: torch.Tensor) -> torch.Tensor:
        """
        Finds the minimum value of a tensor along dimension 0, which is useful for identifying the smallest values across different samples when comparing real and synthetic data distributions.

        Args:
            N (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: A tensor containing the minimum values along dimension 0. This represents the minimum feature values across the dataset, aiding in the comparison of feature ranges between real and synthetic datasets.
        """
        return N.min(dim=0).values

    @staticmethod
    def ft_var(N):
        """
        Calculates the variance of a tensor along dimension 0. This is a crucial step in assessing the statistical similarity between real and synthetic datasets generated by the GAN, ensuring that the generated data captures the variability present in the original data.

        Args:
            N (torch.Tensor): The input tensor.

        Returns:
            torch.Tensor: The variance of the input tensor along dimension 0.
        """
        return torch.var(N, dim=0)

    @staticmethod
    def ft_std(N):
        """
        Calculates the standard deviation of a tensor along the first dimension (dimension 0). This is used to understand the spread or dispersion of the generated synthetic data across different samples, ensuring the generated data maintains a similar statistical distribution to the real data.

        Args:
            N (torch.Tensor): The input tensor representing a batch of generated samples.

        Returns:
            torch.Tensor: The standard deviation of the input tensor along dimension 0, representing the standard deviation for each feature across the generated samples.
        """
        return torch.std(N, dim=0)

    @staticmethod
    def ft_range(N: torch.Tensor) -> torch.Tensor:
        """
        Calculates the range of values (max - min) along the first dimension (dimension 0) of the input tensor. This is useful for understanding the spread or variability of the data along that dimension, which helps assess how well the generated data captures the characteristics of the original data.

        Args:
            N: The input tensor.

        Returns:
            torch.Tensor: A tensor containing the range (max - min) of values along dimension 0.
        """
        return N.max(dim=0).values - N.min(dim=0).values

    def ft_sparsity(self, N: torch.Tensor) -> torch.Tensor:
        """
        Calculates the feature sparsity of a given tensor.

        This method computes the sparsity of each feature in the input tensor `N`.
        Sparsity is defined as the ratio of the total number of instances to the
        number of unique values for each feature, normalized to the range [0, 1].
        This metric helps to assess the diversity of feature values, which is crucial
        for generating synthetic data that accurately reflects the statistical
        properties of the original dataset. By quantifying feature sparsity, we can
        ensure that the generated data maintains a similar level of variability
        as the real data, thereby preserving its utility for downstream tasks.

        Args:
            N (torch.Tensor): A tensor of shape (num_instances, num_features) representing the input data.

        Returns:
            torch.Tensor: A tensor of shape (num_features,) containing the sparsity
            score for each feature, normalized to the range [0, 1]. The tensor is
            moved to the device specified by `self.device`.
        """
        ans = torch.tensor([attr.size(0) / torch.unique(attr).size(0) for attr in N.T])

        num_inst = N.size(0)
        norm_factor = 1.0 / (num_inst - 1.0)
        result = (ans - 1.0) * norm_factor

        return result.to(self.device)

    def pad_only(self, tensor, target_len):
        """
        Pads a tensor with zeros to a specified length, ensuring consistent input sizes for subsequent processing steps. This is particularly useful when dealing with variable-length sequences that need to be batched or processed by models requiring fixed-size inputs.

        Args:
            tensor (torch.Tensor): The input tensor to be padded.
            target_len (int): The desired length of the padded tensor.

        Returns:
            torch.Tensor: The padded tensor, or the original tensor if its length is already greater than or equal to `target_len`.
        """
        if tensor.shape[0] < target_len:
            padding = torch.zeros(target_len - tensor.shape[0]).to(self.device)
            return torch.cat([tensor, padding])

        return tensor

    def get_mfs(self, X, y, subset=None):
        """
        Computes a set of meta-features on the input data. These meta-features capture essential characteristics of the dataset, which is crucial for evaluating and ensuring the utility of synthetic data generated by GANs.

        Args:
            X (torch.Tensor): The input data tensor.
            y (torch.Tensor, optional): The target variable tensor. Required if 'gravity' is in the subset.
            subset (list of str, optional): A list of meta-feature names to compute. If None, defaults to ['mean', 'var'].

        Returns:
            torch.Tensor: A tensor containing the computed meta-features, padded to the maximum shape among the computed features and stacked into a single tensor. This allows for consistent representation and comparison of different meta-features.
        """
        if subset is None:
            subset = ["mean", "var"]

        mfs = []
        for name in subset:
            if name not in self.feature_methods:
                raise ValueError(f"Unsupported meta-feature: '{name}'")

            if name == "gravity":
                if y is None:
                    raise ValueError("Meta-feature 'gravity' requires `y`.")
                res = self.feature_methods[name](X, y)
                res = torch.tile(res, (X.shape[-1],))  # match dimensionality
            else:
                res = self.feature_methods[name](X)

            mfs.append(res)
        shapes = [i.shape.numel() for i in mfs]
        mfs = [self.pad_only(mf, max(shapes)) for mf in mfs]
        return torch.stack(mfs)

    def test_me(self, subset=None):
        """
        Compares meta-feature extraction using the `pymfe` package and the `MFEToTorch` class.

        This method fetches the California Housing dataset, extracts meta-features using both `pymfe` and the `MFEToTorch` class, and then compares the results. This comparison helps validate the correctness and consistency of the meta-feature extraction process implemented in the `MFEToTorch` class, ensuring that it aligns with established meta-feature extraction tools.

        Args:
            subset (list, optional): A list of meta-features to extract. If None, defaults to ["mean", "var"].

        Returns:
            pandas.DataFrame: A DataFrame containing the meta-features extracted by both `pymfe` and `MFEToTorch`, along with any discrepancies between the two.
        """
        if subset is None:
            subset = ["mean", "var"]

        from sklearn.datasets import fetch_california_housing

        bunch = fetch_california_housing(as_frame=True)
        X, y = bunch.data, bunch.target
        print(f"Init data shape: {X.shape} + {y.shape}")

        mfe = MFE(groups="statistical", summary=None)
        mfe.fit(X.values, y.values)
        ft = mfe.extract()

        pymfe = pd.DataFrame(
            map(lambda x: [x], ft[1]), index=ft[0], columns=["pymfe"]
        ).dropna()

        X_tensor = torch.tensor(X.values)
        y_tensor = torch.tensor(y)

        mfs = self.get_mfs(X_tensor, y_tensor, subset).numpy()
        mfs_df = pd.DataFrame({"torch_mfs": list(mfs)})

        mfs_df.index = subset
        # mfs_df = mfs_df.reindex(self.mfs_available)

        res = pymfe.merge(mfs_df, left_index=True, right_index=True, how="outer")

        def round_element(val, decimals=2):
            if isinstance(val, list):
                return [round(x, decimals) for x in val]
            elif isinstance(val, np.ndarray):
                return np.round(val, decimals)
            return round(val, decimals)

        res = res.map(lambda x: round_element(x, 5)).dropna()

        print(res)

feature_methods property

Returns a dictionary that maps feature names to their corresponding extraction methods.

This mapping is essential for calculating a comprehensive set of statistical properties on both real and synthetic datasets. These features are then used to evaluate the quality and utility of the generated synthetic data by comparing them against the features of the real data.

Returns:

Name Type Description
dict

A dictionary where keys are feature names (strings) and values are the corresponding feature extraction methods. See the class docstring for a complete table of available features.

change_device(device)

Changes the device where computations will be performed.

Parameters:

Name Type Description Default
device str

The target device (e.g., 'cpu', 'cuda').

required

This method is crucial for ensuring that the model and data reside on the same device, allowing for efficient computation and utilization of available hardware resources during the synthetic data generation and evaluation processes.

Source code in wgan_gp/pymfe_to_torch.py
159
160
161
162
163
164
165
166
167
168
169
170
def change_device(self, device):
    """
    Changes the device where computations will be performed.

    Args:
        device (str): The target device (e.g., 'cpu', 'cuda').

    This method is crucial for ensuring that the model and data reside on the same device,
    allowing for efficient computation and utilization of available hardware resources
    during the synthetic data generation and evaluation processes.
    """
    self.device = device

corrcoef(tensor, rowvar=True)

Calculates the Pearson product-moment correlation coefficients, normalizing the covariance matrix by the standard deviations to obtain correlation values. This provides a measure of the linear relationship between variables in the input tensor, which is useful for comparing real and synthetic data.

Parameters:

Name Type Description Default
tensor Tensor

Input data tensor.

required
rowvar bool

If True (default), rows represent variables, with observations in the columns. Otherwise, columns represent variables.

True

Returns:

Type Description

torch.Tensor: Pearson product-moment correlation coefficients matrix.

Source code in wgan_gp/pymfe_to_torch.py
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
def corrcoef(self, tensor, rowvar=True):
    """
    Calculates the Pearson product-moment correlation coefficients, normalizing the covariance matrix by the standard deviations to obtain correlation values. This provides a measure of the linear relationship between variables in the input tensor, which is useful for comparing real and synthetic data.

    Args:
        tensor (torch.Tensor): Input data tensor.
        rowvar (bool, optional): If True (default), rows represent variables, with observations in the columns. Otherwise, columns represent variables.

    Returns:
        torch.Tensor: Pearson product-moment correlation coefficients matrix.
    """
    covariance = self.cov(tensor, rowvar=rowvar)
    variance = covariance.diagonal(0, -1, -2)
    if variance.is_complex():
        variance = variance.real
    stddev = variance.sqrt()
    covariance /= stddev.unsqueeze(-1)
    covariance /= stddev.unsqueeze(-2)
    if covariance.is_complex():
        covariance.real.clip_(-1, 1)
        covariance.imag.clip_(-1, 1)
    else:
        covariance.clip_(-1, 1)
    return covariance

cov(tensor, rowvar=True, bias=False) staticmethod

Estimates the covariance matrix of a given tensor, crucial for understanding the statistical relationships within the data. This is a key step in evaluating how well the generated synthetic data captures the underlying dependencies present in the original data.

    Args:
        tensor (torch.Tensor): Input data tensor.
        rowvar (bool, optional): If True (default), rows represent variables, with observations in the columns. If False, columns represent variables.
        bias (bool, optional): If False (default), then the normalization is by N-1. Otherwise, normalization is by N.

    Returns:
        torch.Tensor: The covariance matrix of the input tensor.
Source code in wgan_gp/pymfe_to_torch.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
@staticmethod
def cov(tensor, rowvar=True, bias=False):
    """
    Estimates the covariance matrix of a given tensor, crucial for understanding the statistical relationships within the data. This is a key step in evaluating how well the generated synthetic data captures the underlying dependencies present in the original data.

            Args:
                tensor (torch.Tensor): Input data tensor.
                rowvar (bool, optional): If True (default), rows represent variables, with observations in the columns. If False, columns represent variables.
                bias (bool, optional): If False (default), then the normalization is by N-1. Otherwise, normalization is by N.

            Returns:
                torch.Tensor: The covariance matrix of the input tensor.
    """
    tensor = tensor if rowvar else tensor.transpose(-1, -2)
    tensor = tensor - tensor.mean(dim=-1, keepdim=True)
    factor = 1 / (tensor.shape[-1] - int(not bool(bias)))
    return factor * tensor @ tensor.transpose(-1, -2).conj()

ft_cor_torch(N)

Calculates the absolute values of the lower triangle elements of a correlation matrix to quantify feature dependencies.

This method computes the correlation matrix of the input tensor N, extracts the elements from the lower triangle (excluding the diagonal), and returns the absolute values of these elements. This is done to summarize the relationships between features, which is useful for evaluating how well the synthetic data captures the dependencies present in the real data. By focusing on the lower triangle and taking absolute values, the method efficiently provides a measure of feature interconnectedness, ignoring self-correlations and directionality.

Parameters:

Name Type Description Default
N Tensor

The input tensor for which to compute the correlation matrix.

required

Returns:

Type Description
Tensor

torch.Tensor: A tensor containing the absolute values of the elements in the lower triangle of the correlation matrix.

Source code in wgan_gp/pymfe_to_torch.py
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
def ft_cor_torch(self, N: torch.Tensor) -> torch.Tensor:
    """
    Calculates the absolute values of the lower triangle elements of a correlation matrix to quantify feature dependencies.

    This method computes the correlation matrix of the input tensor `N`,
    extracts the elements from the lower triangle (excluding the diagonal),
    and returns the absolute values of these elements. This is done to summarize the relationships between features,
    which is useful for evaluating how well the synthetic data captures the dependencies present in the real data.
    By focusing on the lower triangle and taking absolute values, the method efficiently provides a measure of feature interconnectedness,
    ignoring self-correlations and directionality.

    Args:
        N: The input tensor for which to compute the correlation matrix.

    Returns:
        torch.Tensor: A tensor containing the absolute values of the elements
            in the lower triangle of the correlation matrix.
    """
    corr_mat = self.corrcoef(N, rowvar=False)
    res_num_rows, _ = corr_mat.shape

    tril_indices = torch.tril_indices(res_num_rows, res_num_rows, offset=-1)
    inf_triang_vals = corr_mat[tril_indices[0], tril_indices[1]]

    return torch.abs(inf_triang_vals)

ft_cov_torch(N)

Calculates the absolute values of the lower triangular elements of the covariance matrix. This focuses on the relationships between variables, extracting the lower triangle to reduce redundancy and focusing on key covariance values. The absolute value ensures that the magnitude of the covariance is considered, regardless of the direction of the relationship.

Parameters:

Name Type Description Default
N Tensor

Input tensor for covariance calculation.

required

Returns:

Type Description
Tensor

torch.Tensor: A tensor containing the absolute values of the lower triangular elements of the covariance matrix.

Source code in wgan_gp/pymfe_to_torch.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
def ft_cov_torch(
    self,
    N: torch.Tensor,
) -> torch.Tensor:
    """
    Calculates the absolute values of the lower triangular elements of the covariance matrix. This focuses on the relationships between variables, extracting the lower triangle to reduce redundancy and focusing on key covariance values. The absolute value ensures that the magnitude of the covariance is considered, regardless of the direction of the relationship.

    Args:
        N: Input tensor for covariance calculation.

    Returns:
        torch.Tensor: A tensor containing the absolute values of the lower triangular elements of the covariance matrix.
    """
    cov_mat = self.cov(N, rowvar=False)

    res_num_rows = cov_mat.shape[0]
    tril_indices = torch.tril_indices(res_num_rows, res_num_rows, offset=-1)
    inf_triang_vals = cov_mat[tril_indices[0], tril_indices[1]]

    return torch.abs(inf_triang_vals)

ft_eigenvals(x)

Computes the eigenvalues of the covariance matrix of the input tensor.

This function is crucial for assessing the diversity and information content of the input data. By calculating the eigenvalues of the covariance matrix, we gain insights into the principal components and variance distribution within the data, which helps to ensure the generated synthetic data retains the key statistical characteristics of the original data.

Parameters:

Name Type Description Default
x Tensor

The input tensor.

required

Returns:

Type Description
Tensor

torch.Tensor: The eigenvalues of the covariance matrix.

Source code in wgan_gp/pymfe_to_torch.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
def ft_eigenvals(self, x: torch.Tensor) -> torch.Tensor:
    """
    Computes the eigenvalues of the covariance matrix of the input tensor.

    This function is crucial for assessing the diversity and information
    content of the input data. By calculating the eigenvalues of the
    covariance matrix, we gain insights into the principal components
    and variance distribution within the data, which helps to ensure
    the generated synthetic data retains the key statistical
    characteristics of the original data.

    Args:
        x (torch.Tensor): The input tensor.

    Returns:
        torch.Tensor: The eigenvalues of the covariance matrix.
    """
    # taking real part of first two eigenvals
    centered = x - x.mean(dim=0, keepdim=True)
    covs = self.cov(centered, rowvar=False)
    return torch.linalg.eigvalsh(covs)

ft_gravity_torch(N, y, norm_ord=2, classes=None, class_freqs=None, cls_inds=None) staticmethod

Computes the gravity between the majority and minority classes.

This method calculates the distance between the mean feature vectors of the majority and minority classes. This distance serves as a measure of class separation in the feature space. By computing this "gravity," the method quantifies the dissimilarity between the most and least frequent classes, providing insight into the dataset's class distribution and feature representation. This information can be valuable for assessing the quality and representativeness of generated synthetic data compared to real data.

Parameters:

Name Type Description Default
N Tensor

Feature tensor of shape (num_instances, num_features).

required
y Tensor

Target tensor of shape (num_instances,).

required
norm_ord Union[int, float]

Order of the norm to compute the distance (e.g., 2 for Euclidean). Defaults to 2.

2
classes Optional[Tensor]

Optional tensor of unique class labels. If None, it's computed from y.

None
class_freqs Optional[Tensor]

Optional tensor of class frequencies. If None, it's computed from y.

None
cls_inds Optional[Tensor]

Optional list of indices for each class. If provided, it uses these indices to select instances.

None

Returns:

Type Description

torch.Tensor: The gravity value, representing the distance between the class centers.

Source code in wgan_gp/pymfe_to_torch.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
@staticmethod
def ft_gravity_torch(
    N: torch.Tensor,
    y: torch.Tensor,
    norm_ord: Union[int, float] = 2,
    classes: Optional[torch.Tensor] = None,
    class_freqs: Optional[torch.Tensor] = None,
    cls_inds: Optional[torch.Tensor] = None,
):
    """
    Computes the gravity between the majority and minority classes.

    This method calculates the distance between the mean feature vectors of the
    majority and minority classes. This distance serves as a measure of class
    separation in the feature space. By computing this "gravity," the method
    quantifies the dissimilarity between the most and least frequent classes,
    providing insight into the dataset's class distribution and feature
    representation. This information can be valuable for assessing the quality
    and representativeness of generated synthetic data compared to real data.

    Args:
        N: Feature tensor of shape (num_instances, num_features).
        y: Target tensor of shape (num_instances,).
        norm_ord: Order of the norm to compute the distance (e.g., 2 for Euclidean). Defaults to 2.
        classes: Optional tensor of unique class labels. If None, it's computed from `y`.
        class_freqs: Optional tensor of class frequencies. If None, it's computed from `y`.
        cls_inds: Optional list of indices for each class. If provided, it uses these indices to select instances.

    Returns:
        torch.Tensor: The gravity value, representing the distance between the class centers.
    """
    if classes is None or class_freqs is None:
        classes, class_freqs = torch.unique(y, return_counts=True)

    ind_cls_maj = torch.argmax(class_freqs)
    class_maj = classes[ind_cls_maj]

    remaining_classes = torch.cat(
        (classes[:ind_cls_maj], classes[ind_cls_maj + 1 :])
    )
    remaining_freqs = torch.cat(
        (class_freqs[:ind_cls_maj], class_freqs[ind_cls_maj + 1 :])
    )

    ind_cls_min = torch.argmin(remaining_freqs)

    if cls_inds is not None:
        insts_cls_maj = N[cls_inds[ind_cls_maj]]
        if ind_cls_min >= ind_cls_maj:
            ind_cls_min += 1
        insts_cls_min = N[cls_inds[ind_cls_min]]
    else:
        class_min = remaining_classes[ind_cls_min]
        insts_cls_maj = N[y == class_maj]
        insts_cls_min = N[y == class_min]

    center_maj = insts_cls_maj.mean(dim=0)
    center_min = insts_cls_min.mean(dim=0)
    gravity = torch.norm(center_maj - center_min, p=norm_ord)

    return gravity

ft_iq_range(X) staticmethod

Calculates the interquartile range (IQR) of a tensor along the first dimension.

The IQR is a measure of statistical dispersion, representing the difference between the 75th and 25th percentiles. This is useful for understanding the spread of the data, which helps to assess the utility of generated synthetic data by comparing its distribution to the real data.

Parameters:

Name Type Description Default
X Tensor

The input tensor of shape [num_samples, num_features].

required

Returns:

Type Description
Tensor

The interquartile range of the input tensor, with shape [num_features]. This represents the spread of each feature across the samples.

Source code in wgan_gp/pymfe_to_torch.py
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
@staticmethod
def ft_iq_range(X: torch.Tensor) -> torch.Tensor:
    """
    Calculates the interquartile range (IQR) of a tensor along the first dimension.

    The IQR is a measure of statistical dispersion, representing the difference between the 75th and 25th percentiles. This is useful for understanding the spread of the data, which helps to assess the utility of generated synthetic data by comparing its distribution to the real data.

    Args:
        X: The input tensor of shape [num_samples, num_features].

    Returns:
        The interquartile range of the input tensor, with shape [num_features]. This represents the spread of each feature across the samples.
    """
    q75, q25 = torch.quantile(X, 0.75, dim=0), torch.quantile(X, 0.25, dim=0)
    iqr = q75 - q25  # shape: [num_features]
    return iqr

ft_kurtosis(x) staticmethod

Calculates the kurtosis of a tensor.

This function computes the kurtosis of the input tensor x, a statistical measure describing the shape of the data's distribution, specifically its tailedness. By calculating kurtosis, we can assess how well the generated data's distribution matches that of the real data, ensuring the synthetic data retains similar statistical properties. This is crucial for maintaining the utility of the generated data in downstream tasks.

Parameters:

Name Type Description Default
x Tensor

Input tensor.

required

Returns:

Type Description
Tensor

torch.Tensor: The kurtosis of the input tensor.

Source code in wgan_gp/pymfe_to_torch.py
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
@staticmethod
def ft_kurtosis(x: torch.Tensor) -> torch.Tensor:
    """
    Calculates the kurtosis of a tensor.

    This function computes the kurtosis of the input tensor `x`, a statistical measure
    describing the shape of the data's distribution, specifically its tailedness.
    By calculating kurtosis, we can assess how well the generated data's distribution
    matches that of the real data, ensuring the synthetic data retains similar statistical
    properties. This is crucial for maintaining the utility of the generated data in downstream tasks.

    Args:
        x (torch.Tensor): Input tensor.

    Returns:
        torch.Tensor: The kurtosis of the input tensor.
    """
    mean = torch.mean(x)
    diffs = x - mean
    var = torch.mean(torch.pow(diffs, 2.0))
    std = torch.pow(var, 0.5)
    zscores = diffs / std
    kurtoses = torch.mean(torch.pow(zscores, 4.0)) - 3.0
    return kurtoses

ft_mad(x, factor=1.4826) staticmethod

Compute the Median Absolute Deviation (MAD) of a tensor.

The MAD is a robust measure of statistical dispersion, useful for understanding the spread of data in both real and synthetic datasets. It helps assess how well the generated data captures the variability present in the original data.

Parameters:

Name Type Description Default
x Tensor

The input tensor.

required
factor float

A scaling factor to make the MAD an unbiased estimator of the standard deviation for normal data. Default is 1.4826, which applies when the data is normally distributed.

1.4826

Returns:

Type Description
Tensor

torch.Tensor: The MAD of the input tensor.

Source code in wgan_gp/pymfe_to_torch.py
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
@staticmethod
def ft_mad(x: torch.Tensor, factor: float = 1.4826) -> torch.Tensor:
    """
    Compute the Median Absolute Deviation (MAD) of a tensor.

    The MAD is a robust measure of statistical dispersion, useful for
    understanding the spread of data in both real and synthetic datasets.
    It helps assess how well the generated data captures the variability
    present in the original data.

    Args:
        x: The input tensor.
        factor: A scaling factor to make the MAD an unbiased estimator of the
            standard deviation for normal data. Default is 1.4826, which
            applies when the data is normally distributed.

    Returns:
        torch.Tensor: The MAD of the input tensor.
    """
    m = x.median(dim=0, keepdim=True).values
    ama = torch.abs(x - m)
    mama = ama.median(dim=0).values
    return mama / (1 / factor)

ft_max(N) staticmethod

Finds the maximum value in a tensor along dimension 0. This is used to identify the most prominent features across a dataset, which is crucial for maintaining data utility in generated synthetic data.

Parameters:

Name Type Description Default
N Tensor

The input tensor.

required

Returns:

Type Description
Tensor

torch.Tensor: A tensor containing the maximum values along dimension 0.

Source code in wgan_gp/pymfe_to_torch.py
386
387
388
389
390
391
392
393
394
395
396
397
@staticmethod
def ft_max(N: torch.Tensor) -> torch.Tensor:
    """
    Finds the maximum value in a tensor along dimension 0. This is used to identify the most prominent features across a dataset, which is crucial for maintaining data utility in generated synthetic data.

    Args:
        N (torch.Tensor): The input tensor.

    Returns:
        torch.Tensor: A tensor containing the maximum values along dimension 0.
    """
    return N.max(dim=0, keepdim=False).values

ft_mean(N) staticmethod

Computes the mean of a tensor along the first dimension to aggregate information across samples. This is useful for summarizing the central tendency of features in the generated or real data.

Parameters:

Name Type Description Default
N Tensor

The input tensor.

required

Returns:

Type Description
Tensor

torch.Tensor: The mean of the input tensor along dimension 0.

Source code in wgan_gp/pymfe_to_torch.py
373
374
375
376
377
378
379
380
381
382
383
384
@staticmethod
def ft_mean(N: torch.Tensor) -> torch.Tensor:
    """
    Computes the mean of a tensor along the first dimension to aggregate information across samples. This is useful for summarizing the central tendency of features in the generated or real data.

    Args:
        N (torch.Tensor): The input tensor.

    Returns:
        torch.Tensor: The mean of the input tensor along dimension 0.
    """
    return N.mean(dim=0)

ft_median(N) staticmethod

Calculates the median of a tensor along the first dimension. This is used to derive a representative central tendency of the data distribution, which is a crucial aspect of maintaining data utility in synthetic data generation.

Parameters:

Name Type Description Default
N Tensor

The input tensor.

required

Returns:

Type Description
Tensor

torch.Tensor: A tensor containing the median values along the first dimension.

Source code in wgan_gp/pymfe_to_torch.py
399
400
401
402
403
404
405
406
407
408
409
410
@staticmethod
def ft_median(N: torch.Tensor) -> torch.Tensor:
    """
    Calculates the median of a tensor along the first dimension. This is used to derive a representative central tendency of the data distribution, which is a crucial aspect of maintaining data utility in synthetic data generation.

    Args:
        N: The input tensor.

    Returns:
        torch.Tensor: A tensor containing the median values along the first dimension.
    """
    return N.median(dim=0).values

ft_min(N) staticmethod

Finds the minimum value of a tensor along dimension 0, which is useful for identifying the smallest values across different samples when comparing real and synthetic data distributions.

Parameters:

Name Type Description Default
N Tensor

The input tensor.

required

Returns:

Type Description
Tensor

torch.Tensor: A tensor containing the minimum values along dimension 0. This represents the minimum feature values across the dataset, aiding in the comparison of feature ranges between real and synthetic datasets.

Source code in wgan_gp/pymfe_to_torch.py
412
413
414
415
416
417
418
419
420
421
422
423
@staticmethod
def ft_min(N: torch.Tensor) -> torch.Tensor:
    """
    Finds the minimum value of a tensor along dimension 0, which is useful for identifying the smallest values across different samples when comparing real and synthetic data distributions.

    Args:
        N (torch.Tensor): The input tensor.

    Returns:
        torch.Tensor: A tensor containing the minimum values along dimension 0. This represents the minimum feature values across the dataset, aiding in the comparison of feature ranges between real and synthetic datasets.
    """
    return N.min(dim=0).values

ft_range(N) staticmethod

Calculates the range of values (max - min) along the first dimension (dimension 0) of the input tensor. This is useful for understanding the spread or variability of the data along that dimension, which helps assess how well the generated data captures the characteristics of the original data.

Parameters:

Name Type Description Default
N Tensor

The input tensor.

required

Returns:

Type Description
Tensor

torch.Tensor: A tensor containing the range (max - min) of values along dimension 0.

Source code in wgan_gp/pymfe_to_torch.py
451
452
453
454
455
456
457
458
459
460
461
462
@staticmethod
def ft_range(N: torch.Tensor) -> torch.Tensor:
    """
    Calculates the range of values (max - min) along the first dimension (dimension 0) of the input tensor. This is useful for understanding the spread or variability of the data along that dimension, which helps assess how well the generated data captures the characteristics of the original data.

    Args:
        N: The input tensor.

    Returns:
        torch.Tensor: A tensor containing the range (max - min) of values along dimension 0.
    """
    return N.max(dim=0).values - N.min(dim=0).values

ft_skewness(x) staticmethod

Computes the skewness of a tensor.

This function calculates the skewness of the input tensor, a key statistical measure reflecting the asymmetry of the data distribution. Preserving this characteristic is crucial when generating synthetic data to maintain the real data's statistical properties.

Parameters:

Name Type Description Default
x Tensor

The input tensor.

required

Returns:

Type Description
Tensor

torch.Tensor: The skewness of the input tensor.

Source code in wgan_gp/pymfe_to_torch.py
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
@staticmethod
def ft_skewness(x: torch.Tensor) -> torch.Tensor:
    """
    Computes the skewness of a tensor.

    This function calculates the skewness of the input tensor, a key statistical
    measure reflecting the asymmetry of the data distribution. Preserving this characteristic
    is crucial when generating synthetic data to maintain the real data's statistical properties.

    Args:
        x (torch.Tensor): The input tensor.

    Returns:
        torch.Tensor: The skewness of the input tensor.
    """
    mean = torch.mean(x)
    diffs = x - mean
    var = torch.mean(torch.pow(diffs, 2.0))
    std = torch.pow(var, 0.5)
    zscores = diffs / std
    skews = torch.mean(torch.pow(zscores, 3.0))
    return skews

ft_sparsity(N)

Calculates the feature sparsity of a given tensor.

This method computes the sparsity of each feature in the input tensor N. Sparsity is defined as the ratio of the total number of instances to the number of unique values for each feature, normalized to the range [0, 1]. This metric helps to assess the diversity of feature values, which is crucial for generating synthetic data that accurately reflects the statistical properties of the original dataset. By quantifying feature sparsity, we can ensure that the generated data maintains a similar level of variability as the real data, thereby preserving its utility for downstream tasks.

Parameters:

Name Type Description Default
N Tensor

A tensor of shape (num_instances, num_features) representing the input data.

required

Returns:

Type Description
Tensor

torch.Tensor: A tensor of shape (num_features,) containing the sparsity

Tensor

score for each feature, normalized to the range [0, 1]. The tensor is

Tensor

moved to the device specified by self.device.

Source code in wgan_gp/pymfe_to_torch.py
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
def ft_sparsity(self, N: torch.Tensor) -> torch.Tensor:
    """
    Calculates the feature sparsity of a given tensor.

    This method computes the sparsity of each feature in the input tensor `N`.
    Sparsity is defined as the ratio of the total number of instances to the
    number of unique values for each feature, normalized to the range [0, 1].
    This metric helps to assess the diversity of feature values, which is crucial
    for generating synthetic data that accurately reflects the statistical
    properties of the original dataset. By quantifying feature sparsity, we can
    ensure that the generated data maintains a similar level of variability
    as the real data, thereby preserving its utility for downstream tasks.

    Args:
        N (torch.Tensor): A tensor of shape (num_instances, num_features) representing the input data.

    Returns:
        torch.Tensor: A tensor of shape (num_features,) containing the sparsity
        score for each feature, normalized to the range [0, 1]. The tensor is
        moved to the device specified by `self.device`.
    """
    ans = torch.tensor([attr.size(0) / torch.unique(attr).size(0) for attr in N.T])

    num_inst = N.size(0)
    norm_factor = 1.0 / (num_inst - 1.0)
    result = (ans - 1.0) * norm_factor

    return result.to(self.device)

ft_std(N) staticmethod

Calculates the standard deviation of a tensor along the first dimension (dimension 0). This is used to understand the spread or dispersion of the generated synthetic data across different samples, ensuring the generated data maintains a similar statistical distribution to the real data.

Parameters:

Name Type Description Default
N Tensor

The input tensor representing a batch of generated samples.

required

Returns:

Type Description

torch.Tensor: The standard deviation of the input tensor along dimension 0, representing the standard deviation for each feature across the generated samples.

Source code in wgan_gp/pymfe_to_torch.py
438
439
440
441
442
443
444
445
446
447
448
449
@staticmethod
def ft_std(N):
    """
    Calculates the standard deviation of a tensor along the first dimension (dimension 0). This is used to understand the spread or dispersion of the generated synthetic data across different samples, ensuring the generated data maintains a similar statistical distribution to the real data.

    Args:
        N (torch.Tensor): The input tensor representing a batch of generated samples.

    Returns:
        torch.Tensor: The standard deviation of the input tensor along dimension 0, representing the standard deviation for each feature across the generated samples.
    """
    return torch.std(N, dim=0)

ft_var(N) staticmethod

Calculates the variance of a tensor along dimension 0. This is a crucial step in assessing the statistical similarity between real and synthetic datasets generated by the GAN, ensuring that the generated data captures the variability present in the original data.

Parameters:

Name Type Description Default
N Tensor

The input tensor.

required

Returns:

Type Description

torch.Tensor: The variance of the input tensor along dimension 0.

Source code in wgan_gp/pymfe_to_torch.py
425
426
427
428
429
430
431
432
433
434
435
436
@staticmethod
def ft_var(N):
    """
    Calculates the variance of a tensor along dimension 0. This is a crucial step in assessing the statistical similarity between real and synthetic datasets generated by the GAN, ensuring that the generated data captures the variability present in the original data.

    Args:
        N (torch.Tensor): The input tensor.

    Returns:
        torch.Tensor: The variance of the input tensor along dimension 0.
    """
    return torch.var(N, dim=0)

get_mfs(X, y, subset=None)

Computes a set of meta-features on the input data. These meta-features capture essential characteristics of the dataset, which is crucial for evaluating and ensuring the utility of synthetic data generated by GANs.

Parameters:

Name Type Description Default
X Tensor

The input data tensor.

required
y Tensor

The target variable tensor. Required if 'gravity' is in the subset.

required
subset list of str

A list of meta-feature names to compute. If None, defaults to ['mean', 'var'].

None

Returns:

Type Description

torch.Tensor: A tensor containing the computed meta-features, padded to the maximum shape among the computed features and stacked into a single tensor. This allows for consistent representation and comparison of different meta-features.

Source code in wgan_gp/pymfe_to_torch.py
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
def get_mfs(self, X, y, subset=None):
    """
    Computes a set of meta-features on the input data. These meta-features capture essential characteristics of the dataset, which is crucial for evaluating and ensuring the utility of synthetic data generated by GANs.

    Args:
        X (torch.Tensor): The input data tensor.
        y (torch.Tensor, optional): The target variable tensor. Required if 'gravity' is in the subset.
        subset (list of str, optional): A list of meta-feature names to compute. If None, defaults to ['mean', 'var'].

    Returns:
        torch.Tensor: A tensor containing the computed meta-features, padded to the maximum shape among the computed features and stacked into a single tensor. This allows for consistent representation and comparison of different meta-features.
    """
    if subset is None:
        subset = ["mean", "var"]

    mfs = []
    for name in subset:
        if name not in self.feature_methods:
            raise ValueError(f"Unsupported meta-feature: '{name}'")

        if name == "gravity":
            if y is None:
                raise ValueError("Meta-feature 'gravity' requires `y`.")
            res = self.feature_methods[name](X, y)
            res = torch.tile(res, (X.shape[-1],))  # match dimensionality
        else:
            res = self.feature_methods[name](X)

        mfs.append(res)
    shapes = [i.shape.numel() for i in mfs]
    mfs = [self.pad_only(mf, max(shapes)) for mf in mfs]
    return torch.stack(mfs)

pad_only(tensor, target_len)

Pads a tensor with zeros to a specified length, ensuring consistent input sizes for subsequent processing steps. This is particularly useful when dealing with variable-length sequences that need to be batched or processed by models requiring fixed-size inputs.

Parameters:

Name Type Description Default
tensor Tensor

The input tensor to be padded.

required
target_len int

The desired length of the padded tensor.

required

Returns:

Type Description

torch.Tensor: The padded tensor, or the original tensor if its length is already greater than or equal to target_len.

Source code in wgan_gp/pymfe_to_torch.py
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
def pad_only(self, tensor, target_len):
    """
    Pads a tensor with zeros to a specified length, ensuring consistent input sizes for subsequent processing steps. This is particularly useful when dealing with variable-length sequences that need to be batched or processed by models requiring fixed-size inputs.

    Args:
        tensor (torch.Tensor): The input tensor to be padded.
        target_len (int): The desired length of the padded tensor.

    Returns:
        torch.Tensor: The padded tensor, or the original tensor if its length is already greater than or equal to `target_len`.
    """
    if tensor.shape[0] < target_len:
        padding = torch.zeros(target_len - tensor.shape[0]).to(self.device)
        return torch.cat([tensor, padding])

    return tensor

test_me(subset=None)

Compares meta-feature extraction using the pymfe package and the MFEToTorch class.

This method fetches the California Housing dataset, extracts meta-features using both pymfe and the MFEToTorch class, and then compares the results. This comparison helps validate the correctness and consistency of the meta-feature extraction process implemented in the MFEToTorch class, ensuring that it aligns with established meta-feature extraction tools.

Parameters:

Name Type Description Default
subset list

A list of meta-features to extract. If None, defaults to ["mean", "var"].

None

Returns:

Type Description

pandas.DataFrame: A DataFrame containing the meta-features extracted by both pymfe and MFEToTorch, along with any discrepancies between the two.

Source code in wgan_gp/pymfe_to_torch.py
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
def test_me(self, subset=None):
    """
    Compares meta-feature extraction using the `pymfe` package and the `MFEToTorch` class.

    This method fetches the California Housing dataset, extracts meta-features using both `pymfe` and the `MFEToTorch` class, and then compares the results. This comparison helps validate the correctness and consistency of the meta-feature extraction process implemented in the `MFEToTorch` class, ensuring that it aligns with established meta-feature extraction tools.

    Args:
        subset (list, optional): A list of meta-features to extract. If None, defaults to ["mean", "var"].

    Returns:
        pandas.DataFrame: A DataFrame containing the meta-features extracted by both `pymfe` and `MFEToTorch`, along with any discrepancies between the two.
    """
    if subset is None:
        subset = ["mean", "var"]

    from sklearn.datasets import fetch_california_housing

    bunch = fetch_california_housing(as_frame=True)
    X, y = bunch.data, bunch.target
    print(f"Init data shape: {X.shape} + {y.shape}")

    mfe = MFE(groups="statistical", summary=None)
    mfe.fit(X.values, y.values)
    ft = mfe.extract()

    pymfe = pd.DataFrame(
        map(lambda x: [x], ft[1]), index=ft[0], columns=["pymfe"]
    ).dropna()

    X_tensor = torch.tensor(X.values)
    y_tensor = torch.tensor(y)

    mfs = self.get_mfs(X_tensor, y_tensor, subset).numpy()
    mfs_df = pd.DataFrame({"torch_mfs": list(mfs)})

    mfs_df.index = subset
    # mfs_df = mfs_df.reindex(self.mfs_available)

    res = pymfe.merge(mfs_df, left_index=True, right_index=True, how="outer")

    def round_element(val, decimals=2):
        if isinstance(val, list):
            return [round(x, decimals) for x in val]
        elif isinstance(val, np.ndarray):
            return np.round(val, decimals)
        return round(val, decimals)

    res = res.map(lambda x: round_element(x, 5)).dropna()

    print(res)