## Collection And Presentation Of Statistical Data – CS Foundation Statistics Notes

The collection of data is a major step in statistics. Before collecting data analyst has to decide for what objective the data is required, statistical units to be used, the degree of accuracy of data required. Data collection is an important aspect of any type of research study. Inaccurate data collection can impact the results of a study and ultimately lead to invalid results. There are 2 types of data

• Primary data
• Secondary data

1. Primary data
The data which is collected for the first time from the source is Primary data. Primary data is the data collected by the investigator himself by some of the below-mentioned methods

• Interview
• Observation
• action research
• case studies
• life histories
• questionnaires etc.

Some of the major sources of primary data in India are the Central Statistical organization, Census of India, National sample survey, and Reserve Bank of India.

2. Secondary data
Secondary data is any information collected by someone else other than its user. It is data that has already been collected and is readily available for use. It is used to gain initial insight into the research problem. It is classified in terms of its source – either internal or external. These are available from

• Already published data eg publications of government department and agencies
• Previous research
• Official statistics
• Mass media products
• Diaries :
• Letters
• Government reports
• Web information
• Historical data and information
• Publications of international bodies like UNO, WTO

3. Precautions before using secondary data
The investigator should consider precautions before using the secondary data. In this connection, the following precautions should be taken into account:
1. Suitable Purpose of Investigation: The investigator must ensure that the data are suitable for the purpose of the inquiry.
2. Inadequate Data: Adequacy of the data is to be judged in the light of the requirements of the survey as well as the geographical area covered by the available data.
3. Definition of Units: The investigator must ensure that the definitions of units that are used by him are the same as in the earlier investigation.
4. Degree of Accuracy: The investigator should keep in mind the degree of accuracy maintained by each investigator.
5. Time and Condition of Collection of Facts: It should be ascertained before making use of available data as to which period the data belong and conditions, under which the data is collected.
6. Comparison: Investigators should keep in mind whether the secondary data is reasonable, consistent, and comparable.
7. Test Checking: The use of the secondary data must do test checking and see that totals and rates have been correctly calculated.
8. Homogeneous Conditions: It is not safe to take published statistics at their face value without knowing their means, values, and limitations.
Weather data is collected from primary sources or secondary sources the standard of data depends on the financial availability, accuracy expected, etc. These days generally secondary data is used as reliable government publications are available.

4. Primary data collection methods
(a) Personal interview
Advantage -permits detailed & in-depth questions & responses,

• interviewer bias -investigator bias
• the interviewer may not give true and correct answers

(b) Telephone Interview

• saves time
• relatively inexpensive

Disadvantage -limited length & depth of questions and responses

(d) Information received from local agencies
Advantages – good when information required on a continuous basis

• Cost-effective

Disadvantages – not useful for comprehensive and extensive study

• Agencies from which data is collected could be biased

5. Census and Sample
A census measures absolutely everyone in the whole country. A representative sample measures a small number of people who fit a particular category of people eg 5000 females are there out of which only the one who is working is to be surveyed regarding smoking habits. Thus this example depicts a sample measuring a small number of people who fit a particular category.

6. Pros of census

• provides a true measure of the population (no sampling error)
• benchmark data may be obtained for future studies
• detailed information about small sub-groups within the population is more likely to be available
• can be used for various research purposes
• reliable data

7. Cons of census

• maybe difficult to enumerate all units of the population within the available time
• higher costs, both in staff and monetary terms, than for a sample
• generally takes longer to collect, process, and release data than from a sample

8. Pros of sample

• costs would generally be lower than for a census
• results may be available in less time
• detailed information can always be done cause it is less costly and less time consuming
• if good sampling techniques are used, the results can be very representative of the actual population
• greater scope of flexibility

9. Cons of sample

• data may not be representative of the total population, particularly where the sample size is small.
• often not suitable for producing benchmark data
• as data are collected from a subset of units and inferences made about the whole population, the data are subject to ‘sampling’ error
• decreased number of units will reduce the detailed information available about sub-groups within a population

10. Presentation of data.
Once the data has been assembled, a systematic procedure is to be followed to make the data presentable so that the purpose for which data was collected is achieved. The procedure to be followed is

1. Classification of data
A data is classified to give it a meaningful shape. Data classification highlights the salient features of the data.
Once the data has been collected it is to be classified into various groups on the basis of common factors present in them. Like a data can be classified on the basis of literacy – literate or illiterate, on the basis of working – working or non-working, on the basis of income – wages below Rs. 5000 and wages between Rs. 5000 to Rs. 7000. Thus we can classify data oil on various grounds. There are 4 bases of classification of data

(a) Chronological – In this type of classification the data is classified on the basis of time. Thus we can say that in the year 2001 GDP was ‘A’ in the year 2002 GDP is ‘B’ Classification is done between related data and in chronological manner data for 2001 than data for 2002 and so on.

(b) Quantitative – Quantitative data is data that can be measured numerically. Things that can be measured precisely rather than through interpretation such as the number of attendees at an event, the temperature in a given location, or a person’s height in inches can be considered quantitative data.

(c) Qualitative – Qualitative data are forms of information gathered in a non-numeric form. In qualitative classification, data are classified on the basis of some attributes or quality such as sex, literacy, religion, etc. In this type of classification, the attribute can not be measured rather it is classified on the basis of whether the attribute is present or absent.

(d) Geographical classification – This type of classification is based on the basis of data available for various regions. Data is classified on the basis of geographical divisions.

2. Tabulation of data
It is the process of condensation of the data for convenience, in statistical processing, presentation, and interpretation of the information. As per Secrist ‘Tables are means of recording in permanent form the analysis that is made through classification and by placing in just opposition things that are similar and should be compared’.
Significance of tabulation of data
Tabulation of data is helpful in many ways. Some of the benefits drawn from the tabulation of data are

• Data gets categorized in such a manner that it can be considered homogenous for further analysis.
• A categorized data is useful when made comparisons
• It discloses trends and patterns of data
• A tabularized data can be considered a sorted data which helps in easy identification
• Helpful in statistical analysis Essential Parts of A Table

Different parts into which a table should be divided would depend on the nature of the data and the purpose for which they have been collected. However, in general, a statistical table is divided into 7 parts which are given below:

1. Table Number:
2. Title of the table:
3. Captions
4. Stubs
5. Body:
7. Footnote Classification of tables

A table can be classified as a data table whenever you need to specify a row or column with header information about that row/column. Broadly it can be classified in the following ways.

1 Simple Table and Complex Tables,
Simple Table
A simple table here means that there is a maximum of one header row and one header column where a header column specifies the type of information in the column. In addition, there are no merged cells within a simple table. Simple tabulation is when the data are tabulated to one characteristic. For example, the class survey conducted on November 11, 2011, determined the frequency or number of students owning different brands of mobile phones like Blackberry, Nokia, iPhone, etc.

Complex Table
Complex tabulation of data that includes more than two characteristics. For example, the frequency or number of girls, boys, and the total class owning the different brands of mobile phones like Blackberry, Nokia, I phone, etc.

Cross Table
Cross tabulations are also a sub-type of complex tabulation that includes cross-classifying factors to build a contingency table of counts or frequencies at each combination of factor levels.

Contingency Table
A contingency table is a display format used to analyze and record the possible relationship between two or more categorical variables. For example, the class survey conducted on November 11, 2011, determined the frequency or number of students owning different brands of mobile phones across boys and girls of ages 17, 18, and 19 (Please Refer to slides no. 9 and 10. in the session 2 presentations). The purpose of this cross-tabulation could be an assumption that boys and girls own certain mobile brands due to a particular age group they represent.

General-purpose table
General-purpose tables are also called reference tables or repository tables, and they provide information for general use and reference. Croxton and Crowden have identified the purpose of such tables in the following words: “Primarily and usually the sole purpose of a reference table is to present data in such a manner that individual items may be found readily by a reader”. Important rules of tabulation.
There are no hard and fast rules for preparing a statistical table. Prof. Bowley has rightly pointed out “In collection and tabulation, common sense is the chief requisite and experience is the chief teacher.” However, the following points should be borne in mind while preparing a table.

1. A good table must contain all the essential parts, such as Table number, Title, Headnote, Caption, Stub, Body, Footnote, and source note.
2. A good table should be simple to understand. It should also be compact, complete, and self-explanatory.
3. A good table should be of the proper size. There should be proper space for rows and columns. One table should not be overloaded with details. Sometimes it is difficult to present entire data in a single table. In that case, data are to be divided into more tables.
4. A good table must be prepared in *a clear manner for its purpose so that a scholar can understand the problem without any strain.
5. The rows and columns of a table must be numbered.
6. In all tables, the captions and stubs should be arranged in some systematic manner. The manner of presentation maybe alphabetically, or chronologically depending upon the requirement.
7. The unit of measurement should be mentioned in the headnote.
8. The figures should be rounded off to the nearest hundred, or thousand or lakh. It helps in avoiding unnecessary details.
9. In case of non-availability of information, one should write N.A. or indicate it by dash (-).

3. Frequency distribution of data
The frequency (f) of a particular observation is the number of times the observation occurs in the data. The distribution of a variable is the pattern of frequencies of the observation. A frequency distribution is a tool for organizing data. We use it to group data into categories and show the number of observations in each category. Frequency distributions are portrayed as frequency tables, histograms, or polygons.
Guidelines for constructing a frequency distribution.

• Each value should fit into a category. The classes should be mutually exhaustive.
• No value should fit into more than 1 category. The classes should be mutually exclusive, there should be no overlapping of classes.
• Make the classes of equal size if possible. This makes it easier to compare the frequency in one class to another.
• Avoid open-ended classes if possible such as “75 and over”.
• Try to use between 5 and 20 classes if possible. If you have fewer than 5 classes, you’re not really breaking up the data, and if you use more than 20 classes, this will probably be information overflow.
• It is usually convenient to use class sizes of 5 or 10, in other words, to have each class containing 5 or 10 possible values.
• It is usually convenient to make the lower limit of the first category a multiple of the class size.

After the first two rules above, the rest are merely suggestions.
Example 1 — Constructing a frequency distribution table
A survey was taken in an area. In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows:
. 1,2, 1,0, 3,4, 0, 1, 1, 1,2,2, 3,2, 3,2, 1,4, 0,0
Frequency distribution table.
Table 1. Frequency table for the number of cars registered in each household

 Number of cars (x) Tally Frequency (f) 0 IIII 4 1 IIIIII 6 2 IIIII 5 3 III 3 4 II 2

Let us understand the construction of the frequency distribution table
Use the following steps to present this data in a frequency distribution table.

• Divide the results (x) into intervals and then count the number of results in each interval.
• In this case, the intervals would be the number of households with no car (0), one car (1), two cars (2), and so forth.
• Make a table with separate columns for the interval numbers (the number of cars per household), the tallied results, and the frequency of results in each interval.
• Label these columns Number of cars, Tally, and Frequency.
• Read the list of data from left to right and place a tally mark in the appropriate row.
• For example, the first result is a 1, so place a tally mark in the row beside where 1 appears in the interval column (Number of cars). The next result is a 2, so place a tally mark in the row beside the 2, and so on. When you reach your fifth tally mark, draw a tally line through the preceding four marks to make your final frequency calculations easier to read.
• Add up the number of tally marks in each row and record them in the final column entitled Frequency.

By looking at this frequency distribution table quickly, we can see that out of 20 households surveyed, 4 households had no cars, 6 households had 1 car, etc.

Example 2 — Constructing a cumulative frequency distribution table
A cumulative frequency distribution table is a more detailed table. It looks almost the same as a frequency distribution table but it has added columns that give the cumulative frequency and the cumulative percentage of the results, as well.
At a recent chess tournament, all 10 of the participants had to fill out a form that gave their names, address, and age. The ages of the participants were recorded as follows:
36, 48, 54, 92, 57, 63, 66, 76, 66, 80
Cumulative frequency table

 Table 2. Ages of participants at a chess tournament Lower Value Upper Value Frequency(f) CumulativeFrequency Percentage Cumulative percentage 35 44 1 1 10.0 10.0 45 54 2 3 20.0 30.0 55 64 2 5 20.0 50.0 65 74 2 7 20.0 70.0 75 84 2 9 20.0 90.0 85 94 1 10 10.0 100.0

Let us understand the construction of this table
Use the following steps to present these data in a cumulative frequency distribution table.

• Divide the results into intervals, and then count the number of results in each interval. In this case, intervals of 10 are appropriate. Since 36 is the lowest age and 92 is the highest age, start the intervals at 35 to 44 and end the intervals with 85 to 94.

• Create a table similar to the frequency distribution table but with three extra columns.

• In the first column or the Lower value column, list the lower value of the result intervals. For example, in the first row, you would put the number 35.

• The next column is the Upper-value column. Place the upper value of the result intervals. For example, you would put the number 44 in the first row.

• The third column is the Frequency column. Record the number of times a result appears between the lower and upper values. In the first row, place the number 1.

• The fourth column is the Cumulative frequency column. Here we add the cumulative frequency of the previous row to the frequency of the current row. Since this is the first row, the cumulative frequency is the same as the frequency. However, in the second row, the frequency for the 35—44 interval (i.e., 1) is added to the frequency for the 45-54 interval (i.e., 2). Thus, the cumulative frequency is 3, meaning we have 3 participants in the 34 to 54 age group. 1+2=3

• The next column is the Percentage column. In this column, list the percentage of the frequency. To do this, divide the frequency by the total number of results and multiply by 100. In this case, the frequency of the first row is 1 and the total number of results is 10. The percentage would then be 10.0.  10.0. (1-10) x 100= 10.0

• The final column is Cumulative percentage. In this column, divide the cumulative frequency by the total number of results and then to make a percentage, multiply by 100. Note that the last number in this column should always equal 100.0. In this example, the cumulative frequency is 1 and the total number of results is 10, therefore the cumulative percentage of the first row is 10.0.
10.0. (1 – 10) x 100 = 10.0

• Class Limit – Separate one class in a grouped frequency distribution from another. The limits could actually appear in the data and have gaps between the upper limit of one class and the lower limit of the next.

• Class intervals – While arranging a large amount of data (in statistics), they are grouped into different classes to get an idea of the distribution, and the range of such class of data is called the Class Interval.

• Class Midpoint -it is the midpoint of the class interval i.e. lower limit + upper limit/2 is the class midpoint.

• Frequency of a class interval – is the number of observations that occur in a particular predefined interval. So, for example, if 20 people aged 5 to 9 appear in our study’s data, the frequency for the 5-9 interval is 20. Classification is of two types according to the class intervals –

• Exclusive Method
• Inclusive Method.

(i) Exclusive Method: In this method, the upper limit of a class becomes the lower limit of the next class. It is called ‘ Exclusive ‘ as we do not put any item that is equal to the upper limit of a class in the same class; we put it in the next class, i.e. the upper limits of classes are excluded from them. For example, a person of age 20 years will not be included in the class-interval (10 – 20) but taken in the next class (20 – 30), since in the class interval (10 – 20) only units ranging from 10 – 19 are included.

(ii) Inclusive Method: In this method, the upper limit of any class interval is kept in the same class interval. In this method, the upper limit of a previous class is less by 1 from the lower limit of the next class interval. In short, this method allows a class interval to include both its lower and upper limits within it.

• The endpoints of a class interval – are the lowest and highest values that a variable can take. Example So, if the intervals are 0 to 4 years, 5 to 9 years, 10 to 14 years, 15 to 19 years, 20 to 24 years, and 25 years and over. The endpoints of the first interval are 0 and 4 if the variable is discrete, and 0 and 4.999 if the variable is continuous. The endpoints of the other class intervals would be determined in the same way.

• Class interval width – is the difference between the lower endpoint of an interval and the lower endpoint of the next interval. For example, if continuous intervals are 0 to 4, 5 to 9, etc., the width of the first five intervals is 5, and the last interval is open since no higher endpoint is assigned to it. The intervals could also be written as 0 to less than 5, 5 to less than 10, 10 to less than 15, 15 to less than 20, 20 to less than 25, and 25 and over.

• Class boundaries – Separate one class in a grouped frequency distribution from another. The boundaries have one more decimal place than the raw data and therefore do not appear in the data. There is no gap between the upper boundary of one class and the lower boundary of the next class. The lower class boundary is found by subtracting 0.5 units from the lower class limit and the upper-class boundary is found by adding 0.5 units to the upper-class limit.

4. Diagrammatic presentations
Although tabulation is a very good technique to present the data, diagrams are an advanced technique to represent data. As a layman, one cannot understand the tabulated data easily but with only a single glance at the diagram, one gets a complete picture of the data presented. According to M.J. Moroney, “diagrams register a meaningful impression almost before we think.

The following are a few advantages of the diagrammatic presentation of data.

• Simple and Easy to understand: The data presented in the form of diagrams is the simplest and the easiest to understand. The entire data can be easily understood even by having a single glance at the diagram.
• Attractive and Impressive: Diagrammatic presentation makes the data more attractive and interesting. Diagrams tend to leave along a lasting impact on the mind.
• Helpful in Making Comparisons: The presentation of data in the form of diagrams helps in making comparisons between two or more groups or two or more periods.

Limitations of Diagrammatic Presentation

1. Diagrams do not present the small differences properly.
2. These can easily be misused.
3. Only artists can draw multi-dimensional diagrams.
4. In statistical analysis, diagrams are of no use. ,
5. Diagrams are just supplemented to tabulation.
6. Only a limited set of data can be presented in the form of a diagram.
7. Diagrammatic presentation of data is a more time-consuming process.
8. Diagrams present preliminary conclusions.
9. Diagrammatic presentation of data shows only an estimate of the actual behavior of the variables.

Guidelines for Diagrammatic presentation

• The diagram should be properly drawn at the outset.
• The pith and substance of the subject matter must be made clear under a broad heading that properly conveys the purpose of a diagram.
• The size of the scale should neither be too big nor too small.
• If it is too big, it may look ugly. If it is too small, it may not convey the meaning.
• In each diagram, the size of the paper must be taken note of. It will help to determine the size of the diagram.
• For clarifying certain ambiguities some notes should be added at the foot of the diagram. This shall provide the visual insight of the diagram.
• Diagrams should be absolutely neat and clean. There should be no vagueness or overwriting on the diagram.
• Simplicity refers to love at first sight. It means that the diagram should convey the meaning clearly and easily.
• The scale must be presented along with the diagram.

Types of Diagrams

A. Line Diagrams
In these diagrams, only a line is drawn to represent one variable. These lines may be vertical or horizontal. The line graphs are usually drawn to represent the time series data related to the temperature, rainfall, population growth, birth rates, and death rates.

B. Simple Bar Diagram
It is also called a columnar diagram. Like line diagrams, these figures are also used where only a single dimension i.e. length can present the data. The procedure is almost the same, only one thickness of lines is measured.

C. Multiple Bar Diagrams
The diagram is used, when we have to make a comparison between more than two variables. The number of variables may be 2, 3, or 4 or more. In the case of 2 variables, pair of bars is drawn. Similarly, in the case of 3 variables, we draw triple bars.

D. Sub-divided Bar Diagram
When different components are grouped in one set of variables or different variables of one component are put together, their representation is made by a sub-divided bar diagram. In this method, different variables are shown in a single bar with different rectangles.

E. Pie chart
A pie diagram is a circle of radius neither too larger nor too small whose area is divided into as many different sectors as there are components of the whole data. This is done by drawing straight lines from the center to the circumference of the circle. The area of the circular lamina represents the whole data and it is equivalent to 360 degrees at the center. The area of each sector is proportional to the value of the corresponding components of the data. The area of a sector is proportional to the angle at the centre. A pie diagram is very useful in drawing comparisons among the various components or between a part and the whole.

Construction

• Select a suitable radius for the circle to be drawn. A radius of 3, 4, or 5 cm may be chosen for the given data set.
• Draw a line from the centre of the circle to the arc as a radius.
• Measure the angles from the arc of the circle for each category of vehicles in an ascending order clock-wise, starting with a smaller angle.
• Complete the diagram by adding the title, sub-title, and legend. The legend mark be chosen for each
variable/category and highlighted by distinct shades/colors.

5. Graphic Presentation
A graph is a visual representation of data by a continuous curve on a squared (graph) paper. Like diagrams, graphs are also attractive, and eye-catching, giving a bird’s eye view of data and revealing their inner pattern. Graphic presentation enjoys numerous forms of expression ranging from the written word to the most abstract of drawings or statistical graphs.

Graphs are actually two perpendicular lines that intersect each other at a point that is called the origin. The horizontal line is called X-axis and the vertical line is called Y-axis. The four parts of the plane are called quadrants. It may be noted that X and Y are positive in the first quadrant, X is negative and Y is positive in the second quadrant, X and Y both are negative in the third quadrant and X is positive and Y is negative in the fourth quadrant. Graphs are commonly used in the presentation of time series and frequency distribution.

Rules to be followed while making graphs are

• It should have a suitable title.
• A suitable unit of measurement should be used. A suitable scale is used to present data. Scale selection should be appropriate so that graph is considerably small and visible in one vision.
• Various sources of data are to be mentioned at the bottom.

• It is an efficient method of showing large numbers of observations in a simple manner
• A visual impression is more permanent than sets of figures of words. ,
• Complex relationships can be demonstrated easily and quickly so that the whole situation is presented simultaneously.
• By the use of color and other devices, one can emphasize certain places. For example, an alarming increase in pollution rate might be pictured in red to bring out the aspect of danger involved.
• Technical qualification is not required to understand the details presented in graphs as they can be easily understood.
• Time-saving for the analyst as data is more understandable
• Helps to locate mean, median, mode „
• Helps in forecasting, extrapolation, and interpolation of data.

• A graph can be used only to show large or crude variations in the date.
• Lack of flexibility in the event a new combination of the data seems appropriate. This follows as a result of the first disadvantage and is one of the reasons why it may be advisable to present the original data in a table or text accompanying the graph.
• Shows only a few characteristics of data
• Distortion of the situation may result from the desire to oversimplify the material.
• Cannot be used in support of some statement
• The construction of a graph or chart may be difficult or costly. This should only apply, however, to large-scale drawings which are employed as posters to educate the lay public or similar groups.

13. Types of Graphs
Basically, they can be divided into two categories

• Graphs of time series
• Graphs of frequency distribution 1. Graphs of time series

Time-series graphs are also known as histograms. In this case, the variable value is dependent on time such as hour, minute, seconds. Such graphs are mostly used by economists, businessmen, and statisticians. Time is represented on X-axis and variable value on Y-axis. Starting value of Y is zero. Various types of graphs are

• Line graph – This graph makes possible the presentation of data with a high degree of accuracy. In fact, careful work and use of the proper coordinate paper make possible the exact reproduction of numerical data, a quality not given to all forms of graphic presentation. Since different types of lines may be used in tracing the data, two or more illustrations may be presented on the same graph. Time is represented on X-axis and variable value on Y-axis. Sometimes the values of data are too large but the variation between the values is too small in such conditions plotting of the graph does not begin from zero instead we make a zigzag horizontal line above the zero for convenience.

• Net balance graph – We use a net balance graph when we have to show data related to income and expenditure or import and exports etc. Here we draw two lines individually like one line showing data related to imports and the other showing data related to export on the same graph. The difference between the two lines (shaded portion) shows the scenario.

Constructing a Time Series Graph
To construct a time-series graph, we must look at both pieces of our paired data set. We start with a standard Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values variable that we are measuring. By doing this each point on the graph corresponds to a date and a measured variable. The points on the graph are typically connected by lines in the order in which they occur.

• Histograms and bar charts are both visual displays of frequencies using columns plotted on a graph. The Y-axis (vertical axis) generally represents the frequency count, while the X-axis (horizontal axis) generally represents the variable being measured. A histogram is a type of graph in which each column represents a numeric variable, in particular, that which is continuous and/or grouped. A histogram shows the distribution of all observations in a quantitative dataset. It is useful for describing the shape, centre, and spread to better understand the distribution of the dataset. It is defined as a pictorial representation of a grouped frequency distribution by means of adjacent rectangles, whose areas are proportional to the frequencies. Generally, the data sets are more than 100.

Features of a histogram:

• The height of the column shows the frequency for a specific range of values.
• Columns are usually of equal width, however, a histogram may show data using unequal ranges (intervals) and therefore have columns of unequal width.
• The values represented by each column must be mutually exclusive and exhaustive. Therefore, there are no spaces between columns and each observation can only ever belong in one column.
• It is important that there is no ambiguity in the labeling of the intervals on the x-axis for continuous or grouped data.

• Frequency polygon – Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the same purpose as histograms but are especially helpful for comparing sets of data. To create a frequency polygon, start just as for histograms, by choosing a class interval. Then draw an X-axis representing the values of the scores in your data. Mark the middle of each class interval with a tick mark, and label it with the middle value represented by the class. Draw the Y-axis to indicate the frequency of each class. Place a point in the middle of each class interval at the height corresponding to its frequency. Finally, connect the points. You should include one class interval below the lowest value in your data and one above the highest value. The graph will then touch the X-axis on both sides. When several distributions are to be compared on the same graph paper, frequency polygons are better than Histograms.

• Frequency Curve is obtained by joining the points of frequency polygon by a freehand smoothed curve. It is used to remove the ruggedness of polygon and to present it in a good form or shape. This curve is used for a frequency distribution of a continuous distribution when the number of data points becomes very large. In order to plot the points on either the frequency polygon or curve, the mid values of the class intervals of the distribution are calculated. Then the frequencies with respect to the midpoints are plotted. However, in a frequency curve, the points are joined by a smooth curve, whereas in a frequency polygon the points are joined by straight lines. Apart from this major difference, a frequency polygon is a closed figure whereas the frequency curve is not.

• Cumulative frequency curve – Cumulative histograms, also known as ogives, are graphs that can be used to determine how many data values lie above or below a particular value in a data set. The cumulative frequency is calculated from a frequency table, by adding each frequency to the total of the frequencies of all data values before it in the data set. The last value for the cumulative frequency will always be equal to the total number of data values since all frequencies will already have been added to the previous total. The cumulative frequency is plotted on the y-axis against the data which is on the x-axis for un-grouped data. When dealing with grouped data, the Ogive is formed by plotting the cumulative frequency against the upper boundary of the class.

An Ogive is used to study the growth rate of data as it shows the accumulation of frequency and hence its growth rate. An ogive is drawn by plotting the beginning of the first interval at a V -value of zero; Plotting the end of every interval at the V -value equal to the cumulative count for that interval; and connecting the points on the plot with straight lines. In this way, the end of the final interval will always be at the total number of data since we will have added up across all intervals

14. The Survey Technique
In this section, we use several words that are commonly found in surveying. Let us describe and define their meanings before we start.
• A survey is a technique in which a sample of prospective respondents is selected from a population. The sample is then studied with a view to drawing inferences from their responses to the statements in a questionnaire, or the questions in a series of interviews.

• Population is the term we use to describe the main group of people from which a sample is drawn. A population, therefore, may be an organization’s workforce, a management group, or a group of customers.

• A sample is a representative cross-section of people drawn from a population so that their responses may be studied. The sizes of the samples and the structures of the surveys are determined by the kind of data that needs to be collected and from whom.

Collection And Presentation Of Statistical Data MCQ Questions

Question 1.
Some of the sources of published data are
a. ILO
b. IBRD
c. WFCO
d. All of the above
d. All of the above

Question 2.
When data is classified on the basis of attribute it is termed as
a. Geographical
b. Qualitative
c. Chronological
d. Both a & c
b. Qualitative

Question 3.
In order to save time and money, psychologists collect their data by:
a. Door-to-door survey
b. The use of censuses.
c. Using the earlier research papers.
d. The use of samples
d. The use of samples

Question 4.
A precise point that separates one class from another is
a. Class boundary
b. Class limit
c. Class interval.
d. Nothing to do with statistics.
a. Class boundary

Question 5.
Continuous series of statistical data can be
a. Divided.
b. Can be calculated in fractions
c. Both a & b
d. Cannot be divided
c. Both a & b

Question 6.
Collected data should be
a. Quantitative
b. Subjective
c. Objective
d. None of the above
a. Quantitative

Question 7.
The total expenditure made by industry under different heads is presented by
a. Histogram
b. Line graph.
c. Bar graph.
d. None of the above.
a. Histogram

Question 8.
Whenever we group data into classes it is recommended that we have
a. Less than 5 classes
b. Between 5 and 20 classes
c. At least 2 classes
d. Between 2 and 3 classes
b. Between 5 and 20 classes

Question 9.
When a line connects points that are the cumulative percent of observation below the upper limit of each interval in a cumulative frequency distribution is known as
a. Mode
b. Histogram
c. Frequency polygon
d. Ogive
d. Ogive

Question 10.
Measures of central tendency are
a. Inferential statistics that identify the best single value for representing a set of data
b. Descriptive statistics that identify the best single value for representing a set of data
c. Inferential statistics that identify the spread of the scores in a data set
d. Descriptive statistics that identify the spread of the scores in a data set.
b. Descriptive statistics that identify the best single value for representing a set of data

Question 11.
The mean is
a. The statistical or arithmetic average.
b. The middlemost score.
c. The most frequently occurring scored
d. The best representation for every set of data.
a. The statistical or arithmetic average.

Question 12.
Given the following data set, what is the value of the median (2 4361 8925 7)
a. 2
b. 4.7
c. 4.5
d. 10
c. 4.5

Question 13.
Which of the following is not a characteristic of the mean?
a. It is affected by extreme scores.
b. It minimizes the sum of squared deviations.
c. The sum of the deviations about the mean is 0
d. It is best used with ordinal data
d. It is best used with ordinal data

Question 14.
For making any date useful it is required that it should be
a. Classified
b. Presented properly
c. Appropriate collection
d. All of the above
d. All of the above

Question 15.
Sources of Semi-Government publication is
a. RBI
b. WHO
c. WTO
d. None of the above
a. RBI

Question 16.
Data collected by research institutions is
a. Primary data
b. Secondary unpublished data
c. Secondary published data
d. All of the above
b. Secondary unpublished data

Question 17.
Secondary data should not be
a. Accurate
b. Reliable
c. Suitable
d. None of the above
d. None of the above

Question 18.
Given the following set of data, what is the range 12 23 34 54 21 8 9 67
a. 55
b. 59
c. 8
d. 56
b. 59

Question 19.
Which of the following statements is true for frequency distribution?
a. The smaller the sample size, the closer the sample mean will be to the population mean, b: The smaller the population size. The smaller the relationship will be between the sample mean and the population means.
c. The larger the population size, the closer the population mean will be to the sample mean.
d. The larger the sample size, the closer the sample mean will be to the population mean.
d. The larger the sample size, the closer the sample mean will be to the population mean.

Question 20.
Classification of data is compulsory as
a. It tells the features of data at a glance
b. It helps in meaningful comparison of data
c. Arranges the huge data
d. All of the above
d. All of the above

Question 21.
When data is classified on the basis of time it is termed as
a. Quantitative
b. Chronological
c. Qualitative
d. All of the above
b. Chronological

Question 22.
Trend and Pattern of data can only be understood if data is in
a. Descriptive manner
b. Tabular form
c. Both a & b
d. None of the above
b. Tabular form

Question 23.
A core distribution data is given for an inventory measuring physical fitness. The type of graph that will be used to display the information will be
a. Histogram
b. Line graph
c. Pie chart
d. Bar graph
a. Histogram

Question 24.
Whenever we use mean, as a measure of central tendency, the precaution to be taken is
a. Skewed up data is there
b. Random data is there
c. Unorganized data is there
d. None of the above
d. None of the above

Question 25.
A population is
a. Same as a sample
b. The selection of a random sample
c. The collection of all items of interest a particular study
d. None of the above
c. The collection of all items of interest a particular study

Question 26.
The entities on which data are collected are
a. Variables
b. Data sets
c. Elements
d. None of the above
c. Elements

Question 27.
Labels or names used to identify attributes of elements are
a. Quantitative data
b. Qualitative data
c. Simple data
d. None of the above
b. Qualitative data

Question 28.
A characteristic of interest for the elements is
a. A variable
b. An element
c. A data set
d. None of the above
a. A variable

Question 29.
Tabulation of data makes the presentation of facts and figures
a. More simplified
b. More understandable
c. Both a & b
d. It is lengthy and occupies space
c. Both a & b

Question 30.
Stubs are
a. Heading for the vertical column
b. Heading of the horizontal column
d. None of the above
b. Heading of the horizontal column

Question 31.
Some of the primary sources of primary data in India are
a. CSO
b. Census of India
c. Central Bank of India
d. Both a & b
d. Both a & b

Question 32.
CSO is
a. Census of India
b. Central Statistical organization
c. Central Survey of India
d. Central statistics of India
b. Central Statistical organization

Question 33.
For checking the reliability of data it should be seen that
a. The degree of accuracy required
b. The procedure of collection of data by the primary source
d. Both a & b
d. Both a & b

Question 34.
It is Not an Uncommon way of collecting primary data
a. Indirect praline view
b. Telephone survey
c. Mailed questions
d. All of the above
d. All of the above

Question 35.
Footnotes are used
a. to point at any specific data which was not expressed in the heading
b. table numbers
c. source note
d. a note at the foot of the document
a. to point at any specific data which was not expressed in the heading

Question 36.
In frequency magnitude,
a. zero should be used to indicate the information that is not available
b. the abbreviation should be avoided
c. magnitude of values& the number of times the value has been repeated
c. magnitude of values& the number of times the value has been repeated

Question 37.
For checking how many students scored above 80 % marks in a class what should be the source of data
a. Secondary data
b. Mailed questions
c. Direct investigation
d. Sample investigation
c. Direct investigation

Question 38.
Indirect oral investigation method of collection of data is to be used when
a. The source is reluctant to give the information
b. When there is less time
c. When the finances are less
d. All of the above
a. The source is reluctant to give the information

Question 39.
For getting periodical information, the best way to seek data is to collect it
a. Indirectly by oral investigation
b. Mailed questionnaire
c. Information received from local agencies
d. All of the above
c. Information received from local agencies

Question 40.
Which amongst these is the most expensive way of collecting data?
a. Questionnaire through enumerator
b. Telephonic survey
c. Direct personal interview
d. Indirect oral investigation
a. Questionnaire through the enumerator

Question 41.
The types of error in sample investigation way of a collection of data are
a. Actual error
b. Random error
c. Sampling error
d. Both b & c
d. Both b & c

Question 42.
Arithmetic operations are appropriate for
a. Qualitative data
b. Quantitative data
c. Both quantitative and qualitative data
d. Neither quantitative nor qualitative data
b. Quantitative data

Question 43.
Zipcodes are an example of
a. Qualitative data
b. Quantitative data
c. Neither quantitative nor qualitative data
d. None of the choices are correct
a. Qualitative data

Question 44.
A tabular summary of a set of data, which shows
the appearance of data elements in several nonoverlapping classes, is termed
a. The class width
b. a frequency polygon
c. a frequency distribution
d. a histogram
c. a frequency distribution

Question 45.
A tabular summary of a set of data showing classes of the data and the fraction of the items belonging to each class is called.
a. the class width
b. a relative frequency distribution
c. a cumulative relative frequency distribution.
d. an ogive
b. a relative frequency distribution

Question 46.
Each percentage of data in pie chart should be multiplied by
a. 3.6%
b. 3.7%
c. 3.5%
d. No need of any multiplication
a. 3.6%

Question 47.
From the below mentioned what, if missed makes the graphic presentation incomplete
a. Unit of measurement
b. Suitable scale
c. Suitable title
d. All of the above
d. All of the above

Question 48.
A graph is a useful tool as
a. Only limited information can be achieved
b. Information provided is not useful for an expert
c. Not precisely correct
d. It does not require that the person who is using graphs should know mathematics
d. It does not require that the person who is using graphs should know mathematics

Question 49.
International tourist is randomly selected on a beach in Singapore to ask how many days they spend in Singapore.
a. This method will give reliable results if many tourists are asked.
b. This method will overestimate the time tourists stay in Singapore
c. This method will underestimate the time tourists stay in Singapore
d. This method will work only in sunny weather.
b. This method will overestimate the time tourists stay in Singapore

Question 50.
A graphical method of presenting qualitative data by frequency distribution is termed.
a. A frequency polygon
b. An ogive
c. A bar graph
d. None of the above
c. A bar graph

Question 51.
The sum of frequencies for all classes will always equal
a. 1
b. the number of elements in a data set
c. the number of classes
d. a value between 0 to 1
b. the number of elements in a data set

Question 52.
The information invariably required to be put in a good statistical table is/are
a. Descriptive thinking
b. Table Number
d. Both b & c
d. Both b & c

Question 53.
If the upper limit of one class coincides with the lower limit of another class. It is
a. Exclusive class intervals
b. Inclusive class intervals
c. Interval
d. nominal
a. Exclusive class intervals

Question 54.
A numerical description of the outcome of an experiment is a random
a. description
b. outcome
C. number
d. variable
d. variable

Question 55.
What ¡s distinctive about quantitative content analysis in comparison to quantitative data analysis?
a. They enable easy calculation for those of us who are not too good with figures.
b. It allows for the constant re-assessment of categories and themes
c. It is easy for researchers to use
d. None of the above
b. It allows for the constant reassessment of categories and themes

Question 56.
A pie chart is:
a. A chart demonstrating the increasing incidence of obesity in society.
b. Only used in catering management research
c. Any form of pictorial representation of data
d. All illustrations where the data are divided into proportional segments according to the share each has of the total value of the data
d. All illustrations where the data are divided into proportional segments according to the share each has of the total value of the data

Question 57.
The general process of gathering, organizing, summarizing, analyzing, and interpreting data is called
a. Statistics
b. Descriptive statistics
c. Qualitative statistics
d. Measurement of statistics
a. Statistics

Question 58.
Frequency distribution is
a. Tabular arrangement of data with corresponding frequency
b. Graphical arrangement of data with corresponding frequency
c. Tabular arrangement of data without corresponding frequency
d. Graphical arrangement of data without corresponding frequency
a. Tabular arrangement of data with corresponding frequency

Question 59.
Diagrammatic representation of data is
a. Determination of a number of classes
b. Determining the magnitude of classes
c. To show data in geometrical figures
d. None of the above
c. To show data in geometrical figures

Question 60.
A complex table represents
a. Only one factor or variable
b. Always two factors or variables
c. Two or more factors or variables
d. All of the above
c. two or more factors or variables
Hint
Complex tabulation of data that includes more than two characteristics. For example, frequency or number of girls, boys and the total class owning the different brands of
mobile phones like Blackberry, Nokia, I phone, etc.

Question 61.
One of the following statements of primary data is wrongly stated
a. These constitute first-hand information
b. These are original in nature
c. These are relatively less costly to collect
d. These are more reliable, accurate, and adequate
c. These are relatively less costly to collect
Hint
Primary data
The data which is collected for the first time from the source is Primary data. Primary data is the data collected by the investigator himself.

Question 62.
Among the following sources of collecting primary data, one is not correctly placed-
a. Annual report of the Reserve Bank of India
b. Indirect oral investigation
c. Telephonic survey
d. A questionnaire sent through the enumerator
a. Annual report of the Reserve Bank of India
Hint
Primary data is the data collected by the investigator himself by some of the below-mentioned methods

• Interview
• Observation
• action research
• case studies
• life histories
• questionnaires etc

Question 63.
The basic demerit of sample investigation is that it
a. is less costly
b. is less time consuming
c. is less reliable because it creates many sources or unanticipated errors
d. Possesses the merit of flexibility
c. is less reliable because it creates many sources or unanticipated errors
Hint
Cons of sample

• data may not be representative of the total population, particularly where the sample size is small.
• often not suitable for producing benchmark data
• as data are collected from a subset of units and inferences made about the whole population, the data are subject to ‘sampling’ error hence less reliable.
• decreased number of units will reduce the detailed information available about sub-groups within a population.

Question 64.
The expression : $$\underline{\text { Class Frequency }} Width of class$$ is Known as:-
a. Frequency density
b. Mid-value of a class interval.
c. Class interval
d. None of the above
a. Frequency density
Hint
Frequency density =$$\underline{\text { Class Frequency }} Width of class$$

Question 65.
A pie chart is having the shape of-
a. A rectangle
b. A circle
c. A square
d. A bar
b. A circle

Question 66.
The qualitative classification includes-
a. Analysis of time series
b. Analysis of date
c. Analysis of series
d. Analysis of attributes
d. Analysis of attributes
Hint
Qualitative classification – Qualitative data are forms of information gathered in a non-numeric form. In qualitative classification, data are classified on the basis of some attributes or quality such as sex, literacy, religion, etc. In this type of classification, the attribute can not be measured rather it is classified on the basis of whether the attribute is present or absent.

Question 67.
Find the odd one out:
a. Data collected from Internet
b. Data collected from RBI Annual Report
c. Data collected by an investigator
d. Data collected from the IMF Fact-sheet.
c. Data collected by an investigator
Hint
A, b, d are examples of secondary data.

Question 68.
Match the following:

 1. Simple Bar Diagram (I) One single bar is drawn 2. Multiple Bar Chart (ii) More than one bar is used to represent two or more variables 3. Pie Chart (iii)Circle Diagram 4. Components Bar Chart (iv)Sub-divided Bar Chart

The correct option is:
a. 1 (i); 2 (ii); 3 (iii); 4 (iv)
b. 1 (iv); 2 (i); 3 (ii); 4(iii)
c. 1 (ii); 2 (iii); 3 (iv); 4 (i)
d. 1 (iii); 2 (iv); 3 (i); 4(ii)
a. 1 (i); 2 (ii); 3 (iii); 4 (iv)

Question 69.
Given are the Country-X’s exports (in Rs/crores) to different regions between April, 20*12 and February 2013:

 Region Europe Asia America Africa Exports 31,516 42,516 23,495 5,133

Which of the following region has 18° in the Pie Chart –
a. Europe
b. Asia
c. America
d. Africa
a. Europe

Question 70.
One of the following is a secondary source of data –
a. Collection of demographic data from, your neighborhood
b. Data collected by an investigator from the shops selling coffee seeds
c. Output data related to the production of wheat from the World Bank Reports
d. Counting the number of persons visiting a shrine on a particular day.
d. Counting the number of persons visiting a shrine on a particular day.
Hint
Regions in degree of pie chart will be
Europe = 31516/102660 × 360° = 110.52°
Asia = 42516/102660 × 360 °= 149.09°
America =23495/102660 × 360° = 82.39°
Africa = 5133/102660 × 360°= 18°

Question 71.
An ‘ogive’ can be used to estimate the value of –
a. Mean
b. Mode
c. Quartiles
d. Harmonic mean.
c. Quartiles
Hint
A secondary source of data is

• Already published data eg publications of government department and agencies
• Previous research
• Official statistics
• Mass media products
• Diaries
• Letters
• Government reports
• Web information
• Historical data and information
• Publications of international bodies like UNO, WTO

Question 72.
Secondary data is collected by-
a. Government
b. Public
c. Online
d. All of the above
c. Online
Hint
Cumulative histograms, also known as ogives, are graphs that can be used to determine how many data values lie above or below a particular value in a data set. They are used to determine median, quartiles, percentiles, etc.

Question 73.
Which is the correct sequence
(i) Collection
(ii) Presentation
(iii) Organisation
(iv) Interpretation
(v) Analysis
a. (a) (i), (ii), (iii), (v), (iv)
b. (i), (ii), (iii), (iv), (v)
c. (c) (i), (iii), (ii), (v), (iv)
d. (ii), (i), (iii), (v), (iv)
a. (a) (i), (ii), (iii), (v), (iv)
Hint
Secondary data is any information collected by someone else other than its the user. It is data that has already been collected and is readily available for use. it’s used to gain initial insight into the research problem. It is classified in terms of its source – either internal or external.

Question 74.
By finding the mid-value of upper widths of adjacent rectangles of the histogram we can make: –
a. Line Graph
b. Histograph
c. Ogive
d. Pie – chart
c. Ogive
Hint
The sequence of data collection is
Collection
Organization
Presentation
Analysis
Interpretation

Question 75.
In which of the following method of data collection more time is required?
a. Census & Sample both
b. Data collection through secondary sources
c. Census investigation
d. Sample investigation
b. Data collection through secondary sources
Secondary data is any information collected by someone else other than its the user.

Question 76.
The total angle contained in a pie chart is:
a. 360°
b. 180°
c. 90°
d. 120°
a. 360°

Question 77.
If the middle point of a class interval is 60 and the lower limit is 45, then the upper limit would be:
a. 75
b. 55
c. 60
d. 70
a. 75

Question 79.
Graph constructed on the basis of Cumulative frequencies arranged in ascending order is:
a. More than Ogive
b. Simple Ogive
c. Less than Ogive
d. Any Ogive
c. Less than Ogive
Hint
Cumulative frequency curve – Cumulative histograms, also known as ogives, are graphs that can be used to determine how many data values lie above or below a particular value in a data set. The cumulative frequency is calculated from a frequency table, by adding each frequency to the total of the frequencies of all data values before it in the data set.

Question 80.
Which of the methods of collecting primary data is more expensive?
a. Online Surveys
b. Observation Methods
c. Mailed Questionnaire
d. Telephonic Interview.