public class EmpiricalDistribution extends AbstractRealDistribution
Represents an empirical probability distribution -- a probability distribution derived from observed data without making any assumptions about the functional form of the population distribution that the data come from.
An EmpiricalDistribution
maintains data structures, called
distribution digests, that describe empirical distributions and
support the following operations:
EmpiricalDistribution
to build grouped
frequency histograms representing the input data or to generate random values
"like" those in the input file -- i.e., the values generated will follow the
distribution of the values in the file.
The implementation uses what amounts to the Variable Kernel Method with Gaussian smoothing:
Digesting the input file
binCount
"bins."EmpiricalDistribution implements the RealDistribution
interface
as follows. Given x within the range of values in the dataset, let B
be the bin containing x and let K be the within-bin kernel for B. Let P(B-)
be the sum of the probabilities of the bins below B and let K(B) be the
mass of B under K (i.e., the integral of the kernel density over B). Then
set P(X < x) = P(B-) + P(B) * K(x) / K(B) where K(x) is the kernel distribution
evaluated at x. This results in a cdf that matches the grouped frequency
distribution at the bin endpoints and interpolates within bins using
within-bin kernels.
binCount
is set by default to 1000. A good rule of thumb
is to set the bin count to approximately the length of the input file divided
by 10. Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_BIN_COUNT
Default bin count
|
protected RandomDataGenerator |
randomData
RandomDataGenerator instance to use in repeated calls to getNext()
|
DEFAULT_SOLVER_ABSOLUTE_ACCURACY
Constructor and Description |
---|
EmpiricalDistribution()
Creates a new EmpiricalDistribution with the default bin count.
|
EmpiricalDistribution(int binCount)
Creates a new EmpiricalDistribution with the specified bin count.
|
EmpiricalDistribution(int binCount,
RandomGenerator generator)
Creates a new EmpiricalDistribution with the specified bin count using the
provided
RandomGenerator as the source of random data. |
EmpiricalDistribution(RandomGenerator generator)
Creates a new EmpiricalDistribution with default bin count using the
provided
RandomGenerator as the source of random data. |
Modifier and Type | Method and Description |
---|---|
double |
cumulativeProbability(double x) |
double |
density(double x) |
int |
getBinCount()
Returns the number of bins.
|
List<StreamingStatistics> |
getBinStats()
Returns a List of
StreamingStatistics instances containing
statistics describing the values in each of the bins. |
double[] |
getGeneratorUpperBounds()
Returns a fresh copy of the array of upper bounds of the subintervals
of [0,1] used in generating data from the empirical distribution.
|
protected RealDistribution |
getKernel(StreamingStatistics bStats)
The within-bin smoothing kernel.
|
double |
getNextValue()
Generates a random value from this distribution.
|
double |
getNumericalMean() |
double |
getNumericalVariance() |
StatisticalSummary |
getSampleStats()
Returns a
StatisticalSummary describing this distribution. |
double |
getSupportLowerBound() |
double |
getSupportUpperBound() |
double[] |
getUpperBounds()
Returns a fresh copy of the array of upper bounds for the bins.
|
double |
inverseCumulativeProbability(double p) |
boolean |
isLoaded()
Property indicating whether or not the distribution has been loaded.
|
boolean |
isSupportConnected() |
void |
load(double[] in)
Computes the empirical distribution from the provided
array of numbers.
|
void |
load(File file)
Computes the empirical distribution from the input file.
|
void |
load(URL url)
Computes the empirical distribution using data read from a URL.
|
void |
reSeed(long seed)
Reseeds the random number generator used by
getNextValue() . |
void |
reseedRandomGenerator(long seed)
Reseed the underlying PRNG.
|
getSolverAbsoluteAccuracy, logDensity, probability
public static final int DEFAULT_BIN_COUNT
protected final RandomDataGenerator randomData
public EmpiricalDistribution()
public EmpiricalDistribution(int binCount)
binCount
- number of bins. Must be strictly positive.MathIllegalArgumentException
- if binCount <= 0
.public EmpiricalDistribution(int binCount, RandomGenerator generator)
RandomGenerator
as the source of random data.binCount
- number of bins. Must be strictly positive.generator
- random data generator (may be null, resulting in default JDK generator)MathIllegalArgumentException
- if binCount <= 0
.public EmpiricalDistribution(RandomGenerator generator)
RandomGenerator
as the source of random data.generator
- random data generator (may be null, resulting in default JDK generator)public void load(double[] in) throws NullArgumentException
in
- the input data arrayNullArgumentException
- if in is nullpublic void load(URL url) throws IOException, MathIllegalArgumentException, NullArgumentException
The input file must be an ASCII text file containing one valid numeric entry per line.
url
- url of the input fileIOException
- if an IO error occursNullArgumentException
- if url is nullMathIllegalArgumentException
- if URL contains no datapublic void load(File file) throws IOException, NullArgumentException
The input file must be an ASCII text file containing one valid numeric entry per line.
file
- the input fileIOException
- if an IO error occursNullArgumentException
- if file is nullpublic double getNextValue() throws MathIllegalStateException
MathIllegalStateException
- if the distribution has not been loadedpublic StatisticalSummary getSampleStats()
StatisticalSummary
describing this distribution.
Preconditions:IllegalStateException
- if the distribution has not been loadedpublic int getBinCount()
public List<StreamingStatistics> getBinStats()
StreamingStatistics
instances containing
statistics describing the values in each of the bins. The list is
indexed on the bin number.public double[] getUpperBounds()
Returns a fresh copy of the array of upper bounds for the bins.
Bins are:
[min,upperBounds[0]],(upperBounds[0],upperBounds[1]],...,
(upperBounds[binCount-2], upperBounds[binCount-1] = max].
public double[] getGeneratorUpperBounds()
Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution. Subintervals correspond to bins with lengths proportional to bin counts.
Preconditions:NullPointerException
- unless a load
method has been
called beforehand.public boolean isLoaded()
public void reSeed(long seed)
getNextValue()
.seed
- random generator seedpublic double density(double x)
Returns the kernel density normalized so that its integral over each bin equals the bin mass.
Algorithm description:
public double cumulativeProbability(double x)
Algorithm description:
public double inverseCumulativeProbability(double p) throws MathIllegalArgumentException
Algorithm description:
inverseCumulativeProbability
in interface RealDistribution
inverseCumulativeProbability
in class AbstractRealDistribution
MathIllegalArgumentException
public double getNumericalMean()
public double getNumericalVariance()
public double getSupportLowerBound()
public double getSupportUpperBound()
public boolean isSupportConnected()
public void reseedRandomGenerator(long seed)
seed
- new seed valueprotected RealDistribution getKernel(StreamingStatistics bStats)
bStats
, unless the bin contains less than 2
observations, in which case a constant distribution is returned.bStats
- summary statistics for the binCopyright © 2016-2022 CS GROUP. All rights reserved.