WebResponse HTTPS GetResponseStream Encoded
I'm just trying to get the HTML of an HTTPS web page. But I'm being
returned Questionmarks and other junk characters. This is my main method:
Public Function PostPage(ByVal URL As String, ByVal enc As Encoding) As
String
Try
ServicePointManager.ServerCertificateValidationCallback = New
RemoteCertificateValidationCallback(AddressOf ValidateCertificate)
Dim htmlRequest As HttpWebRequest =
DirectCast(WebRequest.Create(URL), HttpWebRequest)
Dim htmlResponse As HttpWebResponse =
DirectCast(htmlRequest.GetResponse(), HttpWebResponse)
Return New
System.IO.StreamReader(htmlResponse.GetResponseStream(),
enc).ReadToEnd()
Catch ex As Exception
Console.WriteLine("Error: " & ex.Message)
End Try
Return ""
End Function
You might notice I am bypassing a certificate, and that my encoding is
parameterized.
Sometimes I include other headers like Accept-Encoding: gzip, deflate, and
UserAgent, etc. But the main thing here is how I call this function. I use
the following:
Sub LearnEncoding(ByVal MyURL As String)
Dim dctResults As New Dictionary(Of String, String)
For Each objEncoding In System.Text.Encoding.GetEncodings
If dctResults.ContainsKey(objEncoding.DisplayName) = False Then
Dim MySpider As New clsWebSpider
dctResults.Add(objEncoding.DisplayName,
MySpider.PostPage(MyURL, objEncoding.GetEncoding))
End If
Next
End Sub
So I try every encoding in the framework (139 of them), and the Dictionary
gives me a quick glance at the result of every attempt. Most are different
from each other, but all are junk.
However, when I run this and see the results in Fiddler, it's perfect
HTML. So I'm getting the response back correctly, I just don't know how to
decode the HTML.
Could be something to do with the Certificate? ValidateCertificate just
returns true:
Public Function ValidateCertificate(ByVal sender As Object,
ByVal certificate As X509Certificate,
ByVal chain As X509Chain,
ByVal sslPolicyErrors As
SslPolicyErrors) As Boolean
Return True
End Function
I also tried the encoding to be GetEncoding(htmlResponse.CharacterSet).
But could be an encoding that I haven't heard of? How would I get that?
Like I said, IE, Chrome, FF, Fiddler, etc all decode it correctly, but I
don't know how to see what encoding they are using to get the HTML. The
Charset in the headers and Meta-Tags of the Response both say UTF-8, but
that returns me symbols to.
No comments:
Post a Comment