Womack Report

August 20, 2007

August 20, Statistics Notes

Filed under: Math,Notes,School — Phillip Womack @ 2:52 pm

Statistics.

Professor is Bob Hill.  Was about five minutes late to class, but we’ll try to avoid drawing conclusions yet.

Have the textbook.  It’s sadly a new book; no used copies.  If I take the shrink-wrap off, I can’t return it.  Might try to leave it untouched today and order one over Amazon, but if needed I’ll crack it open.

Big classroom.  Not nearly full, but it’s not a small class, either.  No power outlets in convenient spots.

Clas plan is to go straight through chapters 1-9, and then skip around a bit.  Chapter 14 will probably be covered.  Prof. expects to cover material in the order presented in the syllabus, but not necessarily to adhere to the schedule in the syllabus.

Grades will be based on exams.  Homework assignments will be collected, but generally not graded for correctness.  Tests will increase in value as time passes; three tests are expected to be 25%, 35%, and 40%.

Not going to need much calculus.  Finite Math and College Algebra needed.  I have those, so no sweat.

No makeup exams.

Story time.  Hill was finishing his MBA, decided to get a doctorate.  Came to U of H to get it.  Had about 9 students in his PhD group.  Required two foreign languages at that time.  He planned on taking Russian, because it’s common in the statistics field.  Lots of major mathematicians were from Russia.  Second language was Statistics, which is effectively a different language.

Point:  Words may have specific meanings in statistics different from conventional English meanings.  Be careful about this.

Statistics can be broken into two branches.
1.  Descriptive Statistics
2.  Inferential Statistics

The principal thrust of the course is inferential statistics, but the first few days of the course will deal with descriptive statistics.

In descriptive statistics, we have a set of data, and are trying to describe or summarize that data.  The U.S. Census is descriptive statistics.

Inferential Statistics is different.  For inferential statistics, you have a population, with certain characteristics.  You withdraw a sample from that population, study the sample’s characteristics, and use that data to make inferences about the larger population.

A characteristic about a population is called a “parameter”.  Usually, a parameter is quantitative data.  A “statistic” is a characteristic of a sample, in the same manner.  Parameters are frequently represented with greek symbols, while statistics are represented with english symbols.

Statistics have to do with the meaning of numbers.  When dealing with numbers, it’s often important to distinguish the “strength of measurement” of those number.  Strength of Measure is often divided into four categories.  In order of ascending strength:
1.  Nominal — Classification level of measurement, or accounting level of measurement
2.  Ordinal — Classifiable by rank.  A > B.  Numbers can’t be compared; A – B is meaningless.
3.  Interval — Classification by length/magnitude.  Members can be compared to each other.
4.  Rates — Classification by magnitude with a meaningful zero value.

That’s it for Chapter 1.  Continuing to Chapter 2 & 3, at 4:45.  So, far, lots of good information.  This class may well be worth my time.  That’s a refreshing change.

Raw Data vs. Group Data

Raw data are the actual values of a data set.  Raw data are the set of numbers which make up the data set.  Raw data is difficult to deal with; any data set of meaningful size is going to be hard to understand at a glance.

Grouped data are the result of defining a set of data classes and counting the elements of the data set that fall into each class.  Grouped data has its own challenges.  It’s important to know how many data classes should be defined, how the classes are actually defined, how each class will be characterized, and how to best display the grouped data.

Each of those questions must be answered for each data set.  When setting the number of classes to define, you do not want too many or too few groups.  There is a commonly used guideline known as Sturge’s Rule.  Sturge’s Rule states that K, the number of classes, should be equal to one plus three-point-three times the log(10) of M, the number of data points.  Rounding will be required.  This is a rule of thumb, not a law; proper number of classes will often be in the area of this value, but judgement is required.  This will be on some test, probably on the second, and is not in the book.

K = 1 + 3.3 * log(10)M

The book recommends between 5 and 15 classes, inclusive.

Once the number of desired classes is determined, the classes must still be defined.  A range of values must be determined; the range is normally equal to the difference between the largest value and the smallest value.  It is commonly desired that each class encompasses an equal portion of the range, but exceptions exist.  Therefore, the width of each class is commonly equal to the range, R, divided by the number of classes, K, or width = R/K.  When defining classes, professor prefers to use a mathematical statement, so that classes are unambiguous.

When classes are characterized, frequently the midpoint or class mark is used as the distinguishing feature.  Midpoint is determined by adding the extreme values of the the class, and dividing by two.  Frequently, the raw data for a data set will be unavailable, and only the grouped data will be usable, and referred to by class mark.

Displaying data has any number of possible variations.  One common method is a frequency distribution, and its many variants.  The frequency distribution is the number of data points which fall into each class.  The relative frequency distribution is the percentage of all data points which falls into a particular class.  The cumulative frequency distribution is the number of data points which into a particular class and all smaller classes.  The cumulative relative frequency distribution is the proportion of data points which fall into a particular class and all smaller classes.  The anti-cumulative frequency and the anti-cumulative relative frequencies are like the standard cumulative frequency variants, but accumulate the data in each class and all larger classes, rather than smaller classes.

Graphical displays of data are varied.  One common method is a histograph, aka bar graph.  Histographs will have a range axis and a value (often frequency) axis.  Each class will have one graph bar.  If the data is plotted with one point at each class mark, rather than a bar, and the points are joined together, one has a frequency polygon graph.  If cumulative relative frequency is graphed on a histogram, and the rightmost point of each class is linked to the next, the result is called an ogive.  (sp?  ojive?)

Read Chapter 2, understand the types of charts.  Stem & Leaf, whisker chart, etc, are less commonly used.  May have questions asked about, still.  Scatter-plots will be important later.  Scatter plots are used for observations involving more than a single variable.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress