Mostrando entradas con la etiqueta estadística. Mostrar todas las entradas
Mostrando entradas con la etiqueta estadística. Mostrar todas las entradas

domingo, 4 de enero de 2015

Cuantificando a un líder



La mirada de un líder

Llegar a la cima es como mucho que ver con cómo se mire como lo que consigues
The Economist



En la sociedad GORILA, el poder pertenece al Espalda Plateadas varones. Estas espléndidas criaturas tienen numerosos marcadores de estado, además de su cabello hacia atrás: son más grandes que el resto de su banda, hacen posturas que llenan el espacio, producen sonidos más profundos, golpean sus pechos vigorosamente y, en general, exudan un aire de aptitud física. Las cosas no son tan diferentes en el mundo corporativo. El ejecutivo típico jefe es más de seis pies de altura, tiene una voz profunda y una buena postura, un toque de gris en su pelo grueso, brillante y, por su edad, un cuerpo en forma. Los jefes se esparcieron por detrás de sus escritorios grandes. Ellos de pie alto al hablar con los subordinados. Su conversación es cargado con pausas de prestigio y enunciados declarativos.

La gran diferencia entre los gorilas y seres humanos es, por supuesto, que la sociedad humana cambia rápidamente. Las últimas décadas han visto un cambio notable en la distribución del poder entre hombres y mujeres, el Occidente y el mundo emergente y geeks y no geeks. Las mujeres corren algunas de las empresas más grandes de América, como General Motors (Mary Barra) e IBM (Virginia Rometty). Más de la mitad de los mayores 2.500 empresas públicas del mundo tienen su sede fuera de Occidente. Geeks apenas fuera de los pantalones cortos se ejecutan algunas de las empresas más dinámicas del mundo. Peter Thiel, uno de los principales inversores de Silicon Valley, ha introducido una regla manta: nunca invertir en un CEO que viste un traje.


Sin embargo, es notable, en esta supuesta edad de la diversidad, cómo muchos jefes aún cumplen con el estereotipo. En primer lugar, son altos: en la investigación para su libro de 2005, "Blink", Malcolm Gladwell encontró que el 30% de los CEOs de las compañías Fortune 500 son 6 pies 2 pulgadas o más, en comparación con el 3,9% de la población estadounidense.

Las personas que "suenan bien" también tienen una marcada ventaja en la carrera por la parte superior. Quantified Communications, una compañía con sede en Texas, pidió a las personas que evaluaran los discursos pronunciados por 120 ejecutivos. Ellos encontraron que la calidad de voz representaron el 23% de las evaluaciones de los oyentes y el contenido del discurso sólo representaba el 11%. Académicos de las escuelas de negocios de la Universidad de California en San Diego y la Universidad de Duke escucharon 792 CEOs masculinos hacer presentaciones a los inversionistas, y encontraron que los que tienen las voces más profundas ganó $ 187.000 un año más que el promedio.

La aptitud física parece importar demasiado: un estudio publicado este mes, por Peter Limbach del Instituto de Tecnología de Karlsruhe y Florian Sonnenburg de la Universidad de Colonia, encontró que las empresas en el índice de Estados Unidos S & P 1500 cuyos directores ejecutivos habían terminado un maratón valían 5% más en promedio que aquellos cuyos jefes no tenían.

Una buena postura hace que las personas actúan como líderes, así como ver como ellos: Amy Cuddy de Harvard Business School señala que el acto mismo de la frente en alto, con los pies plantados firmemente y algo aparte, el pecho y los hombros hacia atrás, aumenta la oferta de la testosterona en la sangre y reduce el suministro de cortisol, un esteroide asociado con el estrés. (Desafortunadamente, esto también aumenta la probabilidad de que pueda hacer una apuesta arriesgada.)

Además de contar con todos estos indicadores supuestamente positivos de aptitud para conducir, los que eligen jefes también dependen de algunos estereotipos negativos. El sobrepeso en las personas especialmente las mujeres-son juzgados incapaces de controlar a sí mismos, que los demás solo. Los que "uptalk" acabar -habitually sus declaraciones en una nota alta como pidiendo una pregunta de gobernarse a sí mismos sobre la base de que suenan tentativa y juvenil.

El ascenso de los gigantes multinacionales de mercados emergentes todavía no ha hecho mucha diferencia a todo esto los estereotipos. Tales jefes de las empresas a menudo sufren el equivalente corporativo de un encogerse colonial. Visten trajes de negocios occidentales. Ellos camada sus conversaciones con Western gestión jerga. Y que empacar sus hijos a la Harvard Business School, donde van a aprender a mirar y sonar como gestores de estilo occidental. Empresas de alta tecnología alegremente abandonan regla del Sr. Thiel una vez que alcanzan un cierto tamaño y reclutan un outsider besuited como CEO. Mujeres líderes han reaccionado de diferentes maneras. Algunos han definido a sí mismos con el uso de juegos de la energía y trabajar largas horas. Otros han celebrado la maternidad: en su libro, "En magra", Sheryl Sandberg, director de operaciones de Facebook, escribe sobre despiojar a sus hijos a bordo de un avión de la empresa.

Posando para poder

Se puede hacer algo acerca de esta predisposición para la promoción de las personas de un determinado tipo? Lo ideal sería que los de seleccionar un nuevo jefe a conciencia habría de lado todos los estereotipos y los aspirantes a jueces exclusivamente en sus méritos. Sin embargo, dada la gran cantidad de candidatos, todos con CV perfectos, los comités de selección siguen buscando el factor "X" y encontrar, por extraño que parezca, que reside en las personas que se ven notablemente como ellos. Otra solución es introducir cuotas para los consejeros delegados y miembros de la junta. Pero el riesgo es que esto termine en formulismo en lugar de una verdadera igualación de oportunidades. Por lo tanto, algunos expertos sugieren gestión apenas aceptamos que los estereotipos y los prejuicios no pueden dejarse de lado, y simplemente ayudamos a los nacidos fuera del proyecto círculo genética magia una sensación de poder y confianza en sí mismo.

Sra Cuddy dio una charla sobre "poses de poder" a la conferencia TED Global 2012 que se ha convertido desde segunda charla más descargados de TED. En su reciente libro, "Presencia Ejecutivo", Sylvia Ana Hewlett del Centro de Innovación Talento en Nueva York insta a las mujeres jóvenes para bajar el registro de sus voces, como lo hizo Margaret Thatcher, eliminar uptalking y otros tics vocales, y mirar a la gente en el los ojos cuando se realicen presentaciones. Ella aconseja a todos los aspirantes a director de trabajar de forma regular y buscar la mejor forma posible. Esto puede sonar como un poco de una excusa. Pero la evidencia es fuerte que los candidatos a puestos más altos todavía pueden verse socavados por las cosas superficiales como la postura y el tono de voz. Hace más de un siglo, Oscar Wilde dijo en broma: "Es sólo gente superficial que no juzgan por las apariencias." Desafortunadamente los que eligen líderes todavía parecen pensar de esta manera.

miércoles, 28 de agosto de 2013

Pruebas A/B sin desvío estándar y como solucionarlas

A/B Testing Duration Data


Let's say you make a change to your website and want to test whether people tend to stay on the site longer after the change.
You might think: that's easy! I'll just compare the average visit lengths before and after the change and then I'll have my answer.
Readers of this blog are, of course, savvier than that; they know they should perform a proper statistical test to determine if a reported difference could be due to chance.
But there's a problem. When comparing two continuous quantities (such as visit durations), the usual statistical test is the two sample t-test. A t-test requires three key pieces of information from each test group: the number of subjects, the sample mean, and the standard deviation.
Unfortunately, many reporting tools only report the mean and count; the standard deviation is apparently an ugly duckling that no one wants to talk about. For example, here's a screenshot from a Google Analytics dashboard:




It's a shame that the standard deviation has been left out here, because it renders a proper t-test impossible.
But that shouldn't stop you from applying some math to the problem. With a simple and fairly reasonable assumption, you can arrive at an answer to report to the Big Boss.
Here's the assumption: the probability of a visitor leaving the site at any given moment is constant.
It's not a perfect assumption, as perhaps you have advertising copy with a few extremely engaging passages that no one would ever leave while reading. But it's not a bad place to start.
That one assumption — sometimes called memorylessness — implies that the lengths of visits will follow an exponential distribution.
The nifty part about the exponential distribution is that its variance is always equal to the square of its mean. That means to perform a two-sample t-test, you just take the standard deviation to be the same as the mean. So if one group stays on the site an average of 62 seconds, you can take 62 seconds to be the standard deviation as well.
As an example, I've plugged in the numbers from that Google Analytics page here. (It quickly becomes clear that Americans spend more time on the example site than their British counterparts.)
Because the Big Boss might not understand the finer points exponential distributions, I've also created a dedicated Survival Times Tool. Type in the number of visitors in each group, and the average length of visit, and it will tell whether either group is sticking around longer in a statistically significant way. The tool also constructs confidence intervals around the mean for your viewing pleasure.
Of course, this tool shouldn't stop you from telling your in-house programmers to please report thestandard deviation whenever they report an average. An assumption can go a long way, but it's never a good substitute for data.

You're reading evanmiller.org, a random collection of math, tech, and musings. You might also enjoy these articles:
...and don't miss my collection of Awesome A/B Tools:

EvanMiller.org

martes, 27 de agosto de 2013

Cómo nunca correr una prueba A/B...

How Not To Run An A/B Test





If you run A/B tests on your website and regularly check ongoing experiments for significant results, you might be falling prey to what statisticians call repeated significance testing errors. As a result, even though your dashboard says a result is statistically significant, there’s a good chance that it’s actually insignificant. This note explains why.

Background

When an A/B testing dashboard says there is a “95% chance of beating original” or “90% probability of statistical significance,” it’s asking the following question: Assuming there is no underlying difference between A and B, how often will we see a difference like we do in the data just by chance? The answer to that question is called the significance level, and “statistically significant results” mean that the significance level is low, e.g. 5% or 1%. Dashboards usually take the complement of this (e.g. 95% or 99%) and report it as a “chance of beating the original” or something like that.
However, the significance calculation makes a critical assumption that you have probably violated without even realizing it: that the sample size was fixed in advance. If instead of deciding ahead of time, “this experiment will collect exactly 1,000 observations,” you say, “we’ll run it until we see a significant difference,” all the reported significance levels become meaningless. This result is completely counterintuitive and all the A/B testing packages out there ignore it, but I’ll try to explain the source of the problem with a simple example.

Example

Suppose you analyze an experiment after 200 and 500 observations. There are four things that could happen:
Scenario 1Scenario 2Scenario 3Scenario 4
After 200 observationsInsignificantInsignificantSignificant!Significant!
After 500 observationsInsignificantSignificant!InsignificantSignificant!
End of experimentInsignificantSignificant!InsignificantSignificant!
Assuming treatments A and B are the same and the significance level is 5%, then at the end of the experiment, we’ll have a significant result 5% of the time.
But suppose we stop the experiment as soon as there is a significant result. Now look at the four things that could happen:
Scenario 1Scenario 2Scenario 3Scenario 4
After 200 observationsInsignificantInsignificantSignificant!Significant!
After 500 observationsInsignificantSignificant!trial stoppedtrial stopped
End of experimentInsignificantSignificant!Significant!Significant!
The first row is the same as before, and the reported significance levels after 200 observations are perfectly fine. But now look at the third row. At the end of the experiment, assuming A and B are actually the same, we’ve increased the ratio of significant relative to insignificant results. Therefore, the reported significance level – the “percent of the time the observed difference is due to chance” – will be wrong.

How big of a problem is this?

Suppose your conversion rate is 50% and you want to test to see if a new logo gives you a conversion rate of more than 50% (or less). You stop the experiment as soon as there is 5% significance, or you call off the experiment after 150 observations. Now suppose your new logo actually does nothing. What percent of the time will your experiment wrongly find a significant result? No more than five percent, right? Maybe six percent, in light of the preceding analysis?
Try 26.1% – more than five times what you probably thought the significance level was. This is sort of a worst-case scenario, since we’re running a significance test after every observation, but it’s not unheard-of. At least one A/B testing framework out there actually provides code for automatically stopping experiments after there is a significant result. That sounds like a neat trick until you realize it’s a statistical abomination.
Repeated significance testing always increases the rate of false positives, that is, you’ll think many insignificant results are significant (but not the other way around). The problem will be present if you ever find yourself “peeking” at the data and stopping an experiment that seems to be giving a significant result. The more you peek, the more your significance levels will be off. For example, if you peek at an ongoing experiment ten times, then what you think is 1% significance is actually just 5% significance. Here are other reported significance values you need to see just to get an actual significance of 5%:
You peeked...   To get 5% actual significance you need...
1 time2.9% reported significance
2 times2.2% reported significance
3 times1.8% reported significance
5 times1.4% reported significance
10 times1.0% reported significance
Decide for yourself how big a problem you have, but if you run your business by constantly checking the results of ongoing A/B tests and making quick decisions, then this table should give you goosebumps.

What can be done?

If you run experiments: the best way to avoid repeated significance testing errors is to not test significance repeatedly. Decide on a sample size in advance and wait until the experiment is over before you start believing the “chance of beating original” figures that the A/B testing software gives you. “Peeking” at the data is OK as long as you can restrain yourself from stopping an experiment before it has run its course. I know this goes against something in human nature, so perhaps the best advice is: no peeking!
Since you are going to fix the sample size in advance, what sample size should you use? This formula is a good rule of thumb:
n=16σ2δ2
Where δ is the minimum effect you wish to detect and σ2 is the sample variance you expect. Of course you might not know the variance, but if it’s just a binomial proportion you’re calculating (e.g. a percent conversion rate) the variance is given by:
σ2=p×(1p)
Committing to a sample size completely mitigates the problem described here.
UPDATE, May 2013: You can see this formula in action with my new interactive Sample Size Calculator. Enter the effect size you wish to detect, set the power and significance levels, and you'll get an easy-to-read number telling you the sample size you need. END OF UPDATE

If you write A/B testing software: Don’t report significance levels until an experiment is over, and stop using significance levels to decide whether an experiment should stop or continue. Instead of reporting significance of ongoing experiments, report how large of an effect can be detected given the current sample size. That can be calculated with:
Where the two t’s are the t-statistics for a given significance level α/2 and power (1β).
Painful as it sounds, you may even consider excluding the “current estimate” of the treatment effect until the experiment is over. If that information is used to stop experiments, then your reported significance levels are garbage.

If you really want to do this stuff right: Fixing a sample size in advance can be frustrating. What if your change is a runaway hit, shouldn’t you deploy it immediately? This problem has haunted the medical world for a long time, since medical researchers often want to stop clinical trials as soon as a new treatment looks effective, but they also need to make valid statistical inferences on their data. Here are a couple of approaches used in medical experiment design that someone really ought to adapt to the web:
  • Sequential experiment design: Sequential experiment design lets you set up checkpoints in advance where you will decide whether or not to continue the experiment, and it gives you the correct significance levels.
  • Bayesian experiment design: With Bayesian experiment design you can stop your experiment at any time and make perfectly valid inferences. Given the real-time nature of web experiments, Bayesian design seems like the way forward.

Conclusion

Although they seem powerful and convenient, dashboard views of ongoing A/B experiments invite misuse. Any time they are used in conjunction with a manual or automatic “stopping rule,” the resulting significance tests are simply invalid. Until sequential or Bayesian experiment designs are implemented in software, anyone running web experiments should only run experiments where the sample size has been fixed in advance, and stick to that sample size with near-religious discipline.

Further reading

Repeated Significance Tests

P. Armitage, C. K. McPherson, and B. C. Rowe. “Significance Tests on Accumulating Data,” Journal of the Royal Statistical Society. Series A (General), Vol. 132, No. 2 (1969), pp. 235-244

Optimal Sample Sizes

John A. List, Sally Sadoff, and Mathis Wagner. “So you want to run an experiment, now what? Some Simple Rules of Thumb for Optimal Experimental Design.” NBER Working Paper No. 15701
Wheeler, Robert E. “Portable Power,” Technometrics, Vol. 16, No. 2 (May, 1974), pp. 193-201

Sequential Experiment Design

Pocock, Stuart J. “Group Sequential Methods in the Design and Analysis of Clinical Trials,” Biometrika, Vol. 64, No. 2 (Aug., 1977), pp. 191-199
Pocock, Stuart J. “Interim Analyses for Randomized Clinical Trials: The Group Sequential Approach,”Biometrics, Vol. 38, No. 1 (Mar., 1982), pp. 153-162

Bayesian Experiment Design

Berry, Donald A. “Bayesian Statistics and the Efficiency and Ethics of Clinical Trials,” Statistical Science, Vol. 19, No. 1 (Feb., 2004), pp. 175-187

Twitter Delicious Facebook Digg Stumbleupon Favorites More

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | Best Hostgator Coupon Code