Showing posts with label attributes. Show all posts
Showing posts with label attributes. Show all posts

Wednesday, March 21, 2012

May I have my attributes discretized based on my own expression?

Hi, all here.

I am just having one question about discretization of continous attributes values. Cos the current discretization methods available in SQL Server 2005 data mining engine are these 3 ones:

.......................................................................................

automatic;

equal areas;

clusters.

..........................................................................................

So how these 3 methods work respectively? I mean like clusters method, how dose it discretize the continous values?

More importantly, can we have a discretization based on our own expression? like when i have one column with values ranging from 1 to 10, may we discretize this column based on expression like: 1-3,4-6,7-10?

Thanks a lot for any guidance.

User-defined ranges are not supported.

Here are descriptions of the supported discretization methods:

· Clusters: This finds buckets by performing single-dimensional clustering on the input values using the K-Means algorithm. It uses Gaussian distributions.

· EqualAreas: This examines the distribution of values across the population and creates bucket ranges such that that the total population is distributed equally across the buckets. In other words, if the distribution of continuous values were plotted as a curve, the areas under the curve covered by each bucket range would be equal. This is useful when there are a large number of duplicate values.

· Automatic: If this is selected, we try obtaining the requested number of buckets by applying the above discretization methods in the following order: Clusters, EqualAreas. We use the first method that gets closest to the number of requested buckets.

The Clusters method use random sampling (with a sample size of 1000) so EqualAreas may be used in situations where sampling is not desirable.

|||Hi, Thanks a lot.|||

However, you can always add a calculated column to do your own discretization. For example you can add a column "AgeDisc" with the expression

CASE WHEN [Age]<20 THEN 'Under 20'
WHEN [Age] <= 30 THEN 'Between 20 and 30'
ELSE 'Over 30'
END

Of course, you will have to map any input data to these values for predictions.

|||Jamie, thanks a lot. Very helpful.