求教fisher z变换's Z test

点击联系发帖人 时间：2015-06-20 07:22

thermo fisher

R Tutorials--Chi Square Test of Independence
CHI SQUARE TEST OF INDEPENDENCE
From the help page, the syntax of the chisq.test(&) function is...
chisq.test(x, y = NULL, correct = TRUE,
p = rep(1/length(x), length(x)), rescale.p = FALSE,
simulate.p.value = FALSE, B = 2000)
This function is used for both the goodness of fit test and the test of
independence, and which test it does depends upon what kind of data you feed
it. If "x" is a numerical vector or a one-dimensional table of numerical
values, a goodness of fit test will be done (or attempted), treating "x" as a
vector of observed frequencies.
If "x" is a 2-D table, array, or matrix, then
it is assumed to be a contingency table of frequencies, and a test of
independence will be done.
Ignore "y". (See the help page if you must know.) The "correct=T" option
applies the Yates continuity correction when "x" is a 2x2 table. Set this to
FALSE if the correction is not desired. For the goodness of fit test, set "p"
equal to the null hypothesized proportions or probabilies for each of the
categories represented in the vector "x". In the examples (see the previous
tutorial), I'll also illustrate how this vector can be set to the expected
frequencies. The default is equal frequencies or probabilities. The remaining
options can be ignored for the moment.
Textbook Problems
The following textbook-like problem uses data from Hand et al. (1994)...
Senie et al. (1981) investigated the relationship between age and
frequency of breast self-examination in a sample of women (Senie,
R. T., Rosen, P. P., Lesser, M. L., and Kinne, D. W. Breast self-
examinations and medical examination relating to breast cancer
stage. American Journal of Public Health, 71, 583-590.)
A summary of the results is presented in the following table:
Frequency of breast self-examination
------------------------------------
Occasionally
-------------------------------------------------
60 and over
------------------------------------------------
From Hand et al., page 307, table 368.
The data have already been tabled for us in most textbook problems. We just have
to get the data into an R data object. There are several ways to do this...
> row1 = c(91,90,51)
# or col1 = c(91,150,109)
> row2 = c(150,200,155)
# and col2 = c(90,200,198)
> row3 = c(109,198,172)
# and col3 = c(51,155,172)
> data.table = rbind(row1,row2,row3)
# and data.table = cbind(col1,col2,col3)
> data.table
[,1] [,2] [,3]
> chisq.test(data.table)
Pearson's Chi-squared test
data.table
X-squared = 25.086, df = 4, p-value = 4.835e-05
If the data are available in an electronic document, like this one, it can be
entered into R using the scan(&) function...
[1] "data.table" "row1"
> rm(list=ls())
# Clean up a bit first!
> the.data = scan()
Read 9 items
> the.data
51 150 200 155 109 198 172
> data.matrix = matrix(the.data, byrow=T, nrow=3)
> data.matrix
[,1] [,2] [,3]
> chisq.test(data.matrix)$statistic
# keeping the output brief
The data table can be copied and pasted one row at a time into the scan(&)
function. The result will be a vector in which the data are entered row-wise.
A matrix is then created using the matrix(&) function, which expects the
data vector to contain column-wise data, i.e., 91, 150, 109, 90, ..., 172. This
behavior can be changed with the "byrow=T" option. The "nrow=" option specifies
how many rows to create in the matrix.
If the problem asks for further data summarization, you will probably want
to give names to the rows and columns of the data table first, since this will
make R output easier to read and especially make barplots much easier to
construct...
> dimnames(data.matrix) = list("Age"=c("lt.45",45.to.59","ge.60"),
Error: unexpected symbol in "dimnames(data.matrix) = list("Age"=c("lt.45",45.to.59"
And here, in all fairness to those who would prefer a hunt-and-click interface,
I suppose I should interpose a comment. This error drove me nuts for a good ten
minutes or more. It even prompted me to go back and look at a previous tutorial
on data entry. To use a command line interface, you do have to have an eye for
minutiae. Do you see the mistake? It took me awhile to spot it! I left out an
open quote. The tip off should have been that R stopped outputting the error
message at the closed quote that had no matching open quote. Let's see if I can
get it right this time (arrrrggghhhh!)...
> dimnames(data.matrix) = list("Age"=c("lt.45","45.to.59","ge.60"),
"Freq"=c("Monthly","Occasionally","Never"))
> data.matrix
Monthly Occasionally Never
> ### At last!
> ### Examine marginal distributions...
> addmargins(data.matrix)
Monthly Occasionally Never
> ### Examine conditional distributions...
> prop.table(data.matrix,1)
Monthly Occasionally
0..2198276
45.to.59 0.2970297
0..3069307
0..3590814
> ### And finally a barplot...
> barplot(data.matrix, beside=T)
# output not shown
> ### Okay, a better barplot...
> barplot(t(data.matrix), beside=T, legend=T, ylim=c(0,250),
ylab="Observed frequencies in sample",
main="Frequency of Breast Self-Examination by Age")
In a future tutorial, we'll learn how to move the legend to a more appealing
location on the graph.
Data From a Table Object
You may remember the following data object, which was used in a couple
previous tutorials...
> UCBAdmissions
# output not shown
> ftable(UCBAdmissions, row.vars=c("Admit"))
Gender Male
512 353 120 138
17 202 131
313 207 205 279 138 351
8 391 244 299 317
> round(prop.table(ftable(UCBAdmissions, row.vars=c("Admit")),2),2)
Gender Male
0.62 0.63 0.37 0.33 0.28 0.06
0.82 0.68 0.34 0.35 0.24 0.07
0.38 0.37 0.63 0.67 0.72 0.94
0.18 0.32 0.66 0.65 0.76 0.93
And don't think for a moment it didn't take me a little trial and error to get
the last version of the data table!
Suppose we want a chi square test of Admit by Gender in Dept. B only. I'll
begin by extracting just those data (although that is technically
unnecessary)...
> deptB = UCBAdmissions[,,2]
# all rows, all cols, but only of layer 2
Male Female
> chisq.test(deptB)
Pearson's Chi-squared test with Yates' continuity correction
X-squared = 0.0851, df = 1, p-value = 0.7705
> chisq.test(deptB)$expected
Admitted 354.966
Rejected 205.034
No significant relationship is found between Admit and Gender in Dept. B, as
might have been expected from examination of the data table. I also took a look
at the expected frequencies, just because I could, although it's not so crucial
to do so when the null hypothesis is retained. Notice that Yate's correction for
continuity is applied by default when a 2x2 table is entered into the chi square
procedure. This behavior can be turned off with the option "correct=F".
> chisq.test(deptB, correct=F)
Pearson's Chi-squared test
X-squared = 0.2537, df = 1, p-value = 0.6145
Data From a Data Frame
The data frame "survey" in the "MASS" package contains responses from 237
statistics students to a series of questions, some of which resulted in
categorical answers. For one item, the students were asked to fold their arms,
and the recorded result was which arm was on top: right, left, or neither. Let's
see how this variable relates to the sex of the student...
<font face="courier"
> data(survey, package="MASS")
> attach(survey)
> table(Sex, Fold)
L on R Neither R on L
In this case, the two vectors of interest, "Sex" and "Fold", both contain
categorical values on a case by case basis. You could table them first (as we
just did), store this table in a data object, and enter the name of this
data object into the chisq.test(&) function, but in cases like these, R
allows a simpler syntax...
> chisq.test(Sex, Fold)
Pearson's Chi-squared test
Sex and Fold
X-squared = 2.5741, df = 2, p-value = 0.2761
As long as the two data vectors, X and Y, are categorical, and they are
arranged such that Xi corresponds on a case by case basis to
Yi, you need only to enter the names of the two data vectors
into the chi square test function. R will do the implied crosstabulation for
> detach(survey)
# Don't forget!
Alternatives When the EFs Are Small
The following data are from a Stanford University study of the effectiveness
of the antidepressant Celexa in the treatment of compulsive shopping. These data
were found in Verzani (2005, p.262f), and the analysis here is similar to his...
-----------------
worse same better
---------------------------
---------------------------
> freqs = c(2,2,3,8,7,2)
> data.matrix = matrix(freqs, nrow=2)
> dimnames(data.matrix) = list("treatment"=c("Celexa","placebo"),
"outcome"=c("worse","same","better"))
> data.matrix
treatment worse same better
> chisq.test(data.matrix)
Pearson's Chi-squared test
data.matrix
X-squared = 5.0505, df = 2, p-value = 0.08004
Warning message:
In chisq.test(data.matrix) : Chi-squared approximation may be incorrect
> chisq.test(data.matrix)$expected
treatment worse same better
Warning message:
In chisq.test(data.matrix) : Chi-squared approximation may be incorrect
The chi square procedure produces a warning message (somewhat cryptic) when any
of the expected frequencies fall below 5. In this case, most of them do, which
should lead us to be concerned about the accuracy of any p-value calculated
from the chi-squared distribution. Since the p-value is close to .05, we might
want to try a more accurate method. What can we do?
There are two choices: a Fisher Exact Test, and a p-value calculated by
Monte Carlo simulation. Let's look at the Fisher Exact Test first.
You may be thinking, "Um, doesn't the Fisher Exact Test work only for 2x2
tables?" Yes, and when a 2x2 table is supplied, R will give you the exact
p-value calculated from the hypergeometric distribution. However, in R, the
Fisher Exact Test has been extended to work with larger tables, provided the
obtained frequencies are not too large. You should see the help page for the
details of how this has been done...
> fisher.test(data.matrix)
Fisher's Exact Test for Count Data
data.matrix
p-value = 0.07303
alternative hypothesis: two.sided
The Fisher Exact Test suggests that the chi square procedure gave us a fairly
accurate result in spite of the low expected frequencies. And don't get your
feathers in a ruffle here, because although there is an "alternative=" option
for this test, it works ONLY for 2x2 tables, in which case R calculates and
tests the odds ratio for significance. You might want to investigate the Fisher
Exact Test further on your own, as there is additional output when a 2x2 table
is supplied.
The other alternative is to simulate the sampling distribution of the test
statistic (in this case, chi squared) using Monte Carlo methods. (To find out
more about this, read the
tutorial.) Fortunately, the chisq.test(&) function incorporates an option
that automates this complex procedure for us...
> chisq.test(data.matrix, simulate.p.value=T, B=999)
Pearson's Chi-squared test with simulated p-value (based on 999
replicates)
data.matrix
X-squared = 5.0505, df = NA, p-value = 0.111
The "simulate.p.value=T" option (default value is FALSE) does the Monte Carlo
simulation using "B=999" (default value is B=2000) replicates. It doesn't look
like we can buy a signficant relationship in this tutorial!
Return to the查看: 1152|回复: 6
阅读权限20威望0 级论坛币79 个学术水平0 点热心指数0 点信用等级0 点经验932 点帖子28精华0在线时间47 小时注册时间最后登录
积分 117, 距离下一级还需 28 积分
权限: 自定义头衔
道具: 彩虹炫, 雷达卡, 热点灯, 雷鸣之声, 涂鸦板, 金钱卡, 显身卡下一级可获得
道具: 匿名卡
购买后可立即获得
权限: 隐身
道具: 金钱卡, 雷鸣之声, 彩虹炫, 雷达卡, 涂鸦板, 热点灯
无聊签到天数: 13 天连续签到: 1 天[LV.3]偶尔看看II
用Fisher’s exact test计算P值与置信区间，使用两种方法后发现：结果中p值是相同的但置信区间不同。求大神解释！
方法一：在数据1中“Unique Subject Identifier”列是病人的编号。“Planned Treatment for Period 01 (N)”是药物的编号（就两种药）。”flag“表示是否发生AE(不良事件)（0是否、1是是）。
proc freq data=test1;
&&tables TRTPN*flag/ chisq&&cl alpha=0.05;
&&ods output FishersExact=pm1 RelativeRisks=pm2;
(test1就是图中的数据1)；
11:49:26 上传
方法二：就是将数据1中发生AE和未发生AE的人数分别算出来。变成图中数据2的样子。
程序是：“Planned Treatment for Period 01 (N)”是药物的编号（就两种药）。”flag“表示是否发生AE(不良事件)（0是否、1是是）。”n”是人数。
proc freq data=test2;
&&tables flag*TRTPN/ chisq&&cl alpha=0.05;
&&weight n；
&&ods output FishersExact=pm1 RelativeRisks=pm2;
(test2就是图中的数据2)；
11:49:28 上传
置信区间为什么会不同，p值为什么会相同？
载入中......
阅读权限36威望3 级论坛币10040 个学术水平415 点热心指数423 点信用等级319 点经验59547 点帖子1896精华在线时间3332 小时注册时间最后登录
积分 7032, 距离下一级还需 3148 积分
权限: 自定义头衔, 签名中使用图片, 设置帖子权限, 隐身, 设置回复可见
道具: 彩虹炫, 雷达卡, 热点灯, 雷鸣之声, 涂鸦板, 金钱卡, 显身卡, 匿名卡, 抢沙发, 提升卡, 沉默卡, 千斤顶, 变色卡下一级可获得
权限: 签名中使用代码
购买后可立即获得
权限: 隐身
道具: 金钱卡, 雷鸣之声, 彩虹炫, 雷达卡, 涂鸦板, 热点灯
本帖最后由 jingju11 于
13:45 编辑
relative risks相等的条件差不多是对角线的乘积相等吧。京剧
阅读权限26威望0 级论坛币440 个学术水平1 点热心指数1 点信用等级1 点经验867 点帖子148精华0在线时间76 小时注册时间最后登录
积分 505, 距离下一级还需 295 积分
权限: 自定义头衔, 签名中使用图片
道具: 彩虹炫, 雷达卡, 热点灯, 雷鸣之声, 涂鸦板, 金钱卡, 显身卡, 匿名卡, 抢沙发下一级可获得
权限: 隐身
购买后可立即获得
权限: 隐身
道具: 金钱卡, 雷鸣之声, 彩虹炫, 雷达卡, 涂鸦板, 热点灯
苦逼签到天数: 87 天连续签到: 1 天[LV.6]常住居民II
你看下&&flag*TRTPN/& & 和TRTPN*flag/ 能是一样的吗？
阅读权限20威望0 级论坛币79 个学术水平0 点热心指数0 点信用等级0 点经验932 点帖子28精华0在线时间47 小时注册时间最后登录
积分 117, 距离下一级还需 28 积分
权限: 自定义头衔
道具: 彩虹炫, 雷达卡, 热点灯, 雷鸣之声, 涂鸦板, 金钱卡, 显身卡下一级可获得
道具: 匿名卡
购买后可立即获得
权限: 隐身
道具: 金钱卡, 雷鸣之声, 彩虹炫, 雷达卡, 涂鸦板, 热点灯
无聊签到天数: 13 天连续签到: 1 天[LV.3]偶尔看看II
linggol 发表于
你看下&&flag*TRTPN/& & 和TRTPN*flag/ 能是一样的吗？真心不是这个问题，两个我都试了，结果没变化。。。
阅读权限20威望0 级论坛币79 个学术水平0 点热心指数0 点信用等级0 点经验932 点帖子28精华0在线时间47 小时注册时间最后登录
积分 117, 距离下一级还需 28 积分
权限: 自定义头衔
道具: 彩虹炫, 雷达卡, 热点灯, 雷鸣之声, 涂鸦板, 金钱卡, 显身卡下一级可获得
道具: 匿名卡
购买后可立即获得
权限: 隐身
道具: 金钱卡, 雷鸣之声, 彩虹炫, 雷达卡, 涂鸦板, 热点灯
无聊签到天数: 13 天连续签到: 1 天[LV.3]偶尔看看II
jingju11 发表于
relative risks相等的条件差不多是对角线的乘积相等吧。京剧我的意思是我用到的数据是一样的，只是处理的方法不一样，为什么置信区间会不一样？
阅读权限36威望3 级论坛币10040 个学术水平415 点热心指数423 点信用等级319 点经验59547 点帖子1896精华在线时间3332 小时注册时间最后登录
积分 7032, 距离下一级还需 3148 积分
权限: 自定义头衔, 签名中使用图片, 设置帖子权限, 隐身, 设置回复可见
道具: 彩虹炫, 雷达卡, 热点灯, 雷鸣之声, 涂鸦板, 金钱卡, 显身卡, 匿名卡, 抢沙发, 提升卡, 沉默卡, 千斤顶, 变色卡下一级可获得
权限: 签名中使用代码
购买后可立即获得
权限: 隐身
道具: 金钱卡, 雷鸣之声, 彩虹炫, 雷达卡, 涂鸦板, 热点灯
bhzhangkelei 发表于
我的意思是我用到的数据是一样的，只是处理的方法不一样，为什么置信区间会不一样？Fisher's exact test 没有方向性。所以颠倒行列，结果不变。但是Relative Risk (RR) 都是不同的，除非某些特殊情况。京剧
RR = (a/a+b)/(c/c+d).
change row and column ==
阅读权限36威望3 级论坛币10040 个学术水平415 点热心指数423 点信用等级319 点经验59547 点帖子1896精华在线时间3332 小时注册时间最后登录
积分 7032, 距离下一级还需 3148 积分
权限: 自定义头衔, 签名中使用图片, 设置帖子权限, 隐身, 设置回复可见
道具: 彩虹炫, 雷达卡, 热点灯, 雷鸣之声, 涂鸦板, 金钱卡, 显身卡, 匿名卡, 抢沙发, 提升卡, 沉默卡, 千斤顶, 变色卡下一级可获得
权限: 签名中使用代码
购买后可立即获得
权限: 隐身
道具: 金钱卡, 雷鸣之声, 彩虹炫, 雷达卡, 涂鸦板, 热点灯
Just to confirm: you mean the confidence limits for p values from Fisher's exact? I don't think we have this statistic. JingJu
论坛好贴推荐Everything you need to do real statistical analysis using Excel
Fisher’s Exact Test
When the conditions for Pearson’s chi-square test are not met, especially when one of more of the cells have expi & 5, an alternative approach with 2 × 2 contingency tables is to use Fisher’s exact test. Since this method is more computationally intense, it is best used for smaller samples.
Example 1: Repeat Example 2 from
using the data in range A5:D8 of Figure 1; i.e. determine whether the cure rate is independent of the therapy used.
Figure 1 – Data and Chi-square test for Example 1
As you can see from Figure 1, the expectation for two of the cells is less than 5. Since we are dealing with a 2 × 2 contingency table with relatively small sample size, it is better to use Fisher’s exact test.
The approach is to determine how many different ways the above marginal frequencies can be achieved and then determine the probability that the above observed cell configuration can be obtained merely by chance.
We can restrict our attention to any one of the cells since once the frequency for one cell is determined the frequencies for the other cells can be determined from the marginal totals. We choose cell B6 since it has the smallest marginal total (namely 9 in cell D6) and it is smaller than the other element that makes up this marginal total (namely 7 in cell C6).
Now cell B6 can take any value between 0 and 9; once this value is set the values of the other three cells can be adjusted to maintain the marginal totals.
The probability that cell B6 takes on a specific value x is equivalent to the probability of getting x successes in a sample of size 9 (cell D6) taken without replacement from a population of size 21 (cell D8) which contains 11 (cell B8) successful choices. This can be calculated by the . Here cells D6 and B8 are cells with the marginal totals corresponding to cell B6 and cell D8 contains the grand total.
Figure 2 contains a table of the probabilities for each possible value of x.
Figure 2 – Fisher exact test for Example 1
Thus, e.g., cell L11 contains the formula
=HYPGEOMDIST(K11,$B$8,$D$6,$D$8)
Our test consists of determining whether the probability that at most 2 of those taking therapy 1 are cured (the observed count in cell B6) is less than .05. From Figure 2, we see that the probability of count 0 is 3.4E-05, the probability of count 1 is .001684 and the probability of count 2 is .022454 for a cumulative probability of .024172 & .05 = α, and so we reject the null hypothesis and conclude there is a significant difference between the cure rates for the two therapies.
There are one-tail and two-tail versions of the test. The p-value for the one tail test (cell L17) is given by the formula =SUM(L6:L8) or equivalently (for versons of Excel starting with Excel 2010)
= HYPGEOM.DIST(K8,B8,D6,D8,TRUE)
The p-value for the two tail test (cell L18) given by the formula
=SUM(L6:L8)+SUM(L14:L15)
where K14 is the leftmost cell in the right tail that has a pdf value ≤ L8 (since .005614 ≤ .022454, but .050522 & .022454). Equivalently, we can use the formula (for versons of Excel starting with Excel 2010)
= HYPGEOM.DIST(K8,B8,D6,D8,TRUE)+1-HYPGEOM.DIST(K13,B8,D6,D8,TRUE)
Real Statistics Excel Function: The following function is provided in the Real Statistics Resource Pack:
FISHERTEST(R1, tails) = the probability calculated by the Fisher exact test for the
2 × 2 contingency table contained in range R1 where tails = the number of tails = 1 or 2 (default).
The range R1 must contain only numeric values.
For Example 1, FISHERTEST(B6:C7,1) = .024172 and FISHERTEST(B6:C7, 2) = .029973.
Real Statistics Resources
Current Section
Charles Zaiontz}

杰西卡呢吗信息网